String Conversion - remove some characters and replace non-digits with ASCII code

String Conversion - remove some characters and replace non-digits with ASCII code - c#

I need to take the value CS5999-1 and convert it to 678359991. Basically replace any alpha character with the equivalent ASCII value and strip the dash. I need to get rid of non-numeric characters and make the value unique (some of the data coming in is all numeric and I determined this will make the records unique).
I have played around with regular expressions and can replace the characters with an empty string, but can't figure out how to replace the character with an ASCII value.
Code is still stuck in .NET 2.0 (Corporate America) in case that matters for any ideas.
I have tried several different methods to do this and no I don't expect SO members to write the code for me. I am looking for ideas.
to replace the alpha characters with an empty string I have used:
strResults = Regex.Replace(strResults , #"[A-Za-z\s]",string.Empty);
This loop will replace the character with itself. Basically if I could replace find a way to substitute the replace value with an the ACSII value I would have it, but have tried converting the char value to int and several other different methods I found and all come up with an error.
foreach (char c in strMapResults)
{
strMapResults = strMapResults.Replace(c,c);
}

Check if each character is in the a-z range. If so, add the ASCII value to the list, and if it is in the 0-9 range, just add the number.
public static string AlphaToAscii(string str)
{
var result = string.Empty;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
result += (int)c;
else if (c >= '0' && c <= '9')
result += c;
}
return result;
}
All characters outside of the alpha-numeric range (such as -) will be ignored.
If you are running this function on particularly large strings or want better performance you may want to use a StringBuilder instead of +=.

For all characters in the ASCII range, the encoded value is the same as the Unicode code point. This is also true of ISO/IEC 8859-1, and UCS-2, but not of other legacy encodings.
And since UCS-2 is the same as UTF-16 for the values in UCS-2 (which includes all ASCII characters, as per the above), and since .NET char is a UTF-16 unit, all you need to do is just cast to int.
var builder = new StringBuilder(str.Length * 3); // Pre-allocate to worse-case scenario
foreach(char c in str)
{
if (c >= '0' && c <= '9')
builder.Append(c);
else if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))
builder.Append((int)c);
}
string result = builder.ToString();

If you want to know how you might do this with a regular expression (you mentioned regex in your question), here's one way to do it.
The code below filters all non-digit characters, converting letters to their ASCII representation, and dumping anything else, including all non-ASCII alphabetical characters. Note that treating (int)char as the equivalent of a character's ASCII value is only valid where the character is genuinely available in the ASCII character set, which is clearly the case for A-Za-z.
MatchEvaluator filter = match =>
{
var alpha = match.Groups["asciialpha"].Value;
return alpha != "" ? ((int) alpha[0]).ToString() : "";
};
var filtered = Regex.Replace("CS5999-1", #"(?<asciialpha>[A-Za-z])|\D", filter);

Try this
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = "CS5999-1";
MatchEvaluator evaluator = new MatchEvaluator(Replace);
string results = Regex.Replace(input, "[A-Za-z\\-]", evaluator);
}
static string Replace(Match match)
{
if (match.Value == "-")
{
return "";
}
else
{
byte[] ascii = Encoding.UTF8.GetBytes(match.Value);
return ascii[0].ToString();
}
}
}
}

Related

c# add comma before every numbers in my string except first number

I am developing as application in asp.net mvc.
I have a string like below
string myString = "1A5#3a2#"
now I want to add a comma after every occurrence of number in my string except the first occurrence.
like
string myNewString "1A,5#,3a,2#";
I know I can use loop for this like below
myNewString
foreach(var ch in myString)
{
if (ch >= '0' && ch <= '9')
{
myNewString = myNewString ==""?"":myNewString + "," + Convert.ToString(ch);
}
else
{
myNewString = myNewString ==""? Convert.ToString(ch): myNewString + Convert.ToString(ch);
}
}

You could use this StringBuilder approach:
public static string InsertBeforeEveryDigit(string input, char insert)
{
StringBuilder sb = new(input);
for (int i = sb.Length - 2; i >= 0; i--)
{
if (!char.IsDigit(sb[i]) && char.IsDigit(sb[i+1]))
{
sb.Insert(i+1, insert);
}
}
return sb.ToString();
}
Console.Write(InsertBeforeEveryDigit("1A5#3a2#", ',')); // 1A,5#,3a,2#
Update: This answer gives a different result than the one from TWM if the string contains consecutive digits like here: "12A5#3a2#". My answer gives: 12A,5#,3a,2#,
TWM's gives: 1,2A,5#,3a,2#. Not sure what is desired.

so, as I understood the below code will work for you
StringBuilder myNewStringBuilder = new StringBuilder();
foreach(var ch in myString)
{
if (ch >= '0' && ch <= '9')
{
if (myNewStringBuilder.Length > 0)
{
myNewStringBuilder.Append(",");
}
myNewStringBuilder.Append(ch);
}
else
{
myNewStringBuilder.Append(ch);
}
}
myString = myNewStringBuilder.ToString();
NOTE
Instead of using myNewString variable, I've used StringBuilder object to build up the new string. This is more efficient than concatenating strings, as concatenating strings creates new strings and discards the old ones. The StringBuilder object avoids this by efficiently storing the string in a mutable buffer, reducing the number of object allocations and garbage collections.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. 😊
string myString = "1A5#3a2#";
var result = Regex.Replace(myString, #"(?<=\d\D*)\d\D*", ",$&");
Regex explanation (#regex101):
\d\D* - matches every occurrence of a digit with any following non-digits (zero+)
(?<=\d\D*) - negative lookbehind so we have at least one group with digit before (i.e. ignore first)
This can be updated if you need to handle consecutive digits (i.e. "1a55b" -> "1a,55b") by changing \d to \d+:
var result = Regex.Replace(myString, #"(?<=\d+\D*)\d+\D*", ",$&");

Why Char.IsDigit returns true for chars which can't be parsed to int?

I often use Char.IsDigit to check if a char is a digit which is especially handy in LINQ queries to pre-check int.Parse as here: "123".All(Char.IsDigit).
But there are chars which are digits but which can't be parsed to int like ۵.
// true
bool isDigit = Char.IsDigit('۵');
var cultures = CultureInfo.GetCultures(CultureTypes.SpecificCultures);
int num;
// false
bool isIntForAnyCulture = cultures
.Any(c => int.TryParse('۵'.ToString(), NumberStyles.Any, c, out num));
Why is that? Is my int.Parse-precheck via Char.IsDigit thus incorrect?
There are 310 chars which are digits:
List<char> digitList = Enumerable.Range(0, UInt16.MaxValue)
.Select(i => Convert.ToChar(i))
.Where(c => Char.IsDigit(c))
.ToList();
Here's the implementation of Char.IsDigit in .NET 4 (ILSpy):
public static bool IsDigit(char c)
{
if (char.IsLatin1(c))
{
return c >= '0' && c <= '9';
}
return CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.DecimalDigitNumber;
}
So why are there chars that belong to the DecimalDigitNumber-category("Decimal digit character, that is, a character in the range 0 through 9...") which can't be parsed to an int in any culture?

It's because it is checking for all digits in the Unicode "Number, Decimal Digit" category, as listed here:
http://www.fileformat.info/info/unicode/category/Nd/list.htm
It doesn't mean that it is a valid numeric character in the current locale. In fact using int.Parse(), you can ONLY parse the normal English digits, regardless of the locale setting.
For example, this doesn't work:
int test = int.Parse("٣", CultureInfo.GetCultureInfo("ar"));
Even though ٣ is a valid Arabic digit character, and "ar" is the Arabic locale identifier.
The Microsoft article "How to: Parse Unicode Digits" states that:
The only Unicode digits that the .NET Framework parses as decimals are the ASCII digits 0 through 9, specified by the code values U+0030 through U+0039. The .NET Framework parses all other Unicode digits as characters.
However, note that you can use char.GetNumericValue() to convert a unicode numeric character to its numeric equivalent as a double.
The reason the return value is a double and not an int is because of things like this:
Console.WriteLine(char.GetNumericValue('¼')); // Prints 0.25
You could use something like this to convert all numeric characters in a string into their ASCII equivalent:
public string ConvertNumericChars(string input)
{
StringBuilder output = new StringBuilder();
foreach (char ch in input)
{
if (char.IsDigit(ch))
{
double value = char.GetNumericValue(ch);
if ((value >= 0) && (value <= 9) && (value == (int)value))
{
output.Append((char)('0'+(int)value));
continue;
}
}
output.Append(ch);
}
return output.ToString();
}

Decimal digits are 0 to 9, but they have many representations in Unicode. From Wikipedia:
The decimal digits are repeated in 23 separate blocks
MSDN specifies that .NET only parses Latin numerals:
However, the only numeric digits recognized by parsing methods are the basic Latin digits 0-9 with code points from U+0030 to U+0039

Regex to find out if the sequence has any special characters

I am looking for a regex to find out the given word sequence has any special characters.
For example.
In this input string
"test?test";
I would like to find out the words got
"test(any special char(s) including space)test"

You can just use [^A-Za-z0-9], which will match anything that is not alphanumeric, but of course it depends on what you consider a "special character." If underscore is not special [\W] can be a shortcut for anything that is not a word (A-Za-z0-9_) character.

You don't really need a regex here. If you want to test for alphanumeric characters, you car use LINQ, for example (or just iterate over the letters):
string input = "test test";
bool valid = input.All(Char.IsLetterOrDigit);
Char.IsLetterOrDigit checks for all Unicode alphanumeric characters. If you only want the English ones, you can write:
public static bool IsEnglishAlphanumeric(char c)
{
return ((c >= 'a') && (c <= 'z'))
|| ((c >= 'A') && (c <= 'Z'))
|| ((c >= '0') && (c <= '9'));
}
and use it similarly:
bool valid = input.All(IsEnglishAlphanumeric);

How do I get a list of all the printable characters in C#?

I'd like to be able to get a char array of all the printable characters in C#, does anybody know how to do this?
edit:
By printable I mean the visible European characters, so yes, umlauts, tildes, accents etc.

This will give you a list with all characters that are not considered control characters:
List<Char> printableChars = new List<char>();
for (int i = char.MinValue; i <= char.MaxValue; i++)
{
char c = Convert.ToChar(i);
if (!char.IsControl(c))
{
printableChars.Add(c);
}
}
You may want to investigate the other Char.IsXxxx methods to find a combination that suits your requirements.

Here's a LINQ version of Fredrik's solution. Note that Enumerable.Range yields an IEnumerable<int> so you have to convert to chars first. Cast<char> would have worked in 3.5SP0 I believe, but as of 3.5SP1 you have to do a "proper" conversion:
var chars = Enumerable.Range(0, char.MaxValue+1)
.Select(i => (char) i)
.Where(c => !char.IsControl(c))
.ToArray();
I've created the result as an array as that's what the question asked for - it's not necessarily the best idea though. It depends on the use case.
Note that this also doesn't consider full Unicode characters, only those in the basic multilingual plane. I don't know what it returns for high/low surrogates, but it's worth at least knowing that a single char doesn't really let you represent everything :(

A LINQ solution (based on Fredrik Mörk's):
Enumerable.Range(char.MinValue, char.MaxValue).Select(c => (char)c).Where(
c => !char.IsControl(c)).ToArray();

TLDR Answer
Use this Regex...
var regex = new Regex(#"[^\p{Cc}^\p{Cn}^\p{Cs}]");
TLDR Explanation
^\p{Cc} : Do not match control characters.
^\p{Cn} : Do not match unassigned characters.
^\p{Cs} : Do not match UTF-8-invalid characters.
Working Demo
I test two strings in this demo: "Hello, World!" and "Hello, World!" + (char)4. char(4) is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static MatchCollection getPrintableChars(string haystack) {
var regex = new Regex(#"[^\p{Cc}^\p{Cn}^\p{Cs}]");
var matches = regex.Matches(haystack);
return matches;
}
public static void Main() {
var teststring1 = "Hello, World!";
var teststring2 = "Hello, World!" + (char)4;
var teststring1unprintablechars = getPrintableChars(teststring1);
var teststring2unprintablechars = getPrintableChars(teststring2);
Console.WriteLine("Testing a Printable String: " + teststring1unprintablechars.Count + " Printable Chars Detected");
Console.WriteLine("Testing a String With 1-Unprintable Char: " + teststring2unprintablechars.Count + " Printable Chars Detected");
foreach (Match unprintablechar in teststring1unprintablechars) {
Console.WriteLine("String 1 Printable Char:" + unprintablechar);
}
foreach (Match unprintablechar in teststring2unprintablechars) {
Console.WriteLine("String 2 Printable Char:" + unprintablechar);
}
}
}
Full Working Demo at IDEOne.com
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

I know ASCII wasn't specifically requested but this is a quick way to get a list of all the printable ASCII characters.
for (Int32 i = 0x20; i <= 0x7e; i++)
{
printableChars.Add(Convert.ToChar(i));
}
See this ASCII table.
Edit:
As stated by Péter Szilvási, the 0x20 and 0x7e in the loop are hexidecimal representations of the base 10 numbers 32 and 126, which are the printable ASCII characters.

public bool IsPrintableASCII(char c)
{
return c >= '\x20' && c <= '\x7e';
}

what's the quickest way to extract a 5 digit number from a string in c#

what's the quickest way to extract a 5 digit number from a string in c#.
I've got
string.Join(null, System.Text.RegularExpressions.Regex.Split(expression, "[^\\d]"));
Any others?

The regex approach is probably the quickest to implement but not the quickest to run. I compared a simple regex solution to the following manual search code and found that the manual search code is ~2x-2.5x faster for large input strings and up to 4x faster for small strings:
static string Search(string expression)
{
int run = 0;
for (int i = 0; i < expression.Length; i++)
{
char c = expression[i];
if (Char.IsDigit(c))
run++;
else if (run == 5)
return expression.Substring(i - run, run);
else
run = 0;
}
return null;
}
const string pattern = #"\d{5}";
static string NotCached(string expression)
{
return Regex.Match(expression, pattern, RegexOptions.Compiled).Value;
}
static Regex regex = new Regex(pattern, RegexOptions.Compiled);
static string Cached(string expression)
{
return regex.Match(expression).Value;
}
Results for a ~50-char string with a 5-digit string in the middle, over 10^6 iterations, latency per call in microseconds (smaller number is faster):
Simple search: 0.648396us
Cached Regex: 2.1414645us
Non-cached Regex: 3.070116us
Results for a ~40K string with a 5-digit string in the middle over 10^4 iterations, latency per call in microseconds (smaller number is faster):
Simple search: 423.801us
Cached Regex: 1155.3948us
Non-cached Regex: 1220.625us
A little surprising: I would have expected Regex -- which is compiled to IL -- to be comparable to the manual search, at least for very large strings.

Use a regular expression (\d{5}) to find the occurrence(s) of the 5 digit number in the string and use int.Parse or decimal.Parse on the match(s).
In the case where there is only one number in text.
int? value = null;
string pat = #"\d{5}"
Regex r = new Regex(pat);
Match m = r.Match(text);
if (m.Success)
{
value = int.Parse(m.Value);
}

Do you mean convert a string to a number? Or find the first 5 digit string and then make it a number? Either way, you'll probably be using decimal.Parse or int.Parse.
I'm of the opinion that Regular Expressions are the wrong approach. A more efficient approach would simply to walk through the string looking for a digit, and then advancing 4 characters and seeing if they are all digits. If they are, you've got your substring. It's not as robust, no, but it doesn't have the overhead either.

Don't use a regular expression at all. It's way more powerful than you need - and that power is likely to hit performance.
If you can give more details of what you need it to do, we can write the appropriate code... (Test cases would be ideal.)

If the numbers exist with other characters regular expressions are a good solution.
EG: ([0-9]{5})
will match - asdfkki12345afdkjsdl, 12345adfaksk, or akdkfa12345

If you have a simple test case like "12345" or even "12345abcd" don't use regex at all. They are not known by they speed.

For most strings a brute force method is going to be quicker than a RegEx.
A fairly noddy example would be:
string strIWantNumFrom = "qweqwe23qeeq3eqqew9qwer0q";
int num = int.Parse(
string.Join( null, (
from c in strIWantNumFrom.ToCharArray()
where c == '1' || c == '2' || c == '3' || c == '4' || c == '5' ||
c == '6' || c == '7' || c == '8' || c == '9' || c == '0'
select c.ToString()
).ToArray() ) );
No doubt there are much quicker ways, and lots of optimisations that depend on the exact format of your string.

This might be faster...
public static string DigitsOnly(string inVal)
{
char[] newPhon = new char[inVal.Length];
int i = 0;
foreach (char c in inVal)
if (c.CompareTo('0') > 0 && c.CompareTo('9') < 0)
newPhon[i++] = c;
return newPhon.ToString();
}
if you want to limit it to at most five digits, then
public static string DigitsOnly(string inVal)
{
char[] newPhon = new char[inVal.Length];
int i = 0;
foreach (char c in inVal)
if (c.CompareTo('0') > 0 && c.CompareTo('9') < 0 && i < 5)
newPhon[i++] = c;
return newPhon.ToString();
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

String Conversion - remove some characters and replace non-digits with ASCII code - c#

Related

c# add comma before every numbers in my string except first number

Why Char.IsDigit returns true for chars which can't be parsed to int?

Regex to find out if the sequence has any special characters

How do I get a list of all the printable characters in C#?

what's the quickest way to extract a 5 digit number from a string in c#

Categories

Resources