translate special character in strings - c#

I have a program that reads from a xml document. In this xml document some of the attributes contain special characters like "\n", "\t", etc.
Is there an easy way to replace all of these strings with the actual character or do I just have to do it manually for each character like the following example?
Manual example:
s.Replace("\\n", "\n").Replace("\\t", "\t")...
edit:
I'm looking for some way to treat the string like an escaped string like this(even though I know this doesn't work)
s.Replace("\\", "\");

Try Regex.Unescape().
Official docs here:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.unescape(v=vs.110).aspx

Why not just walk the document and build up the new string in one pass. Saves a lot of duplicate searching and intermediate allocations
string ConvertSpecialCharacters(string input) {
var builder = new StringBuilder();
bool inEscape = false;
for (int i = 0; i < input.Length ; i++) {
if (inEscape) {
switch (input[i]) {
case 'n':
builder.Append('\t');
break;
case 't':
builder.Append('\n');
break;
default:
builder.Append('\\');
builder.Append(input[i]);
}
else if (input[i] == '\\' && i + 1 < input.Length) {
inEscape = true;
}
else {
builder.Append(input[i]);
}
}
return builder.ToString();
}

Related

Trimming a function from a file so that the only remaining characters are the function name and parameters

I am writing some code that replaces an old C exe. The original c file would read a file and then trim the contents and put them into two new files, a .c and a .h file. I am doing the same thing, but in C#. I have everything figured out, except for how to trim a function down so that only the function name and parameters are put into the .h file.
This is an example of two of the functions:
void
M_SCP_Msg_ClearNVMemory(
Marshal_dataFunc* _argDataFunc_, Marshal_dataFuncArg _argDataFuncArg_, void const* _argSrc_)
{
SCP_Msg_ClearNVMemory const* _src_ = (SCP_Msg_ClearNVMemory const*)_argSrc_;
M_uint8_t(_argDataFunc_, _argDataFuncArg_, &_src_->operation);
}
void
MA_SCP_Msg_ClearNVMemory(
Marshal_dataFunc* argDataFunc, Marshal_dataFuncArg argDataFuncArg,
void const* argSrc, unsigned argNSrcElem)
{
SCP_Msg_ClearNVMemory const* src = (SCP_Msg_ClearNVMemory const*)argSrc;
for (; argNSrcElem > 0; --argNSrcElem)
{
M_SCP_Msg_ClearNVMemory(argDataFunc, argDataFuncArg, src++);
}
}
This would be the expected output:
extern void M_SCP_Msg_ClearNVMemory(
Marshal_dataFunc* argDataFunc, Marshal_dataFuncArg argDataFuncArg, void const* argSrc);
extern void MA_SCP_Msg_ClearNVMemory(
Marshal_dataFunc* argDataFunc, Marshal_dataFuncArg argDataFuncArg,
void const* argSrc, unsigned argNSrcElem);
Currently, the lines of the original file are read in as strings which are assigned through a streamreader and then that string is later written to a streamwriter, so I thought iterating through and finding any strings containing any functions would be a good place to start, and once I have those strings I could edit them somehow. This is what I have so far, finList being the list of strings and fin being the string I will write to the output file.
List<string> finList = new List<string>();
finList.AddRange(fin.Split('\n'));
for (int x = 0; x < finList.Count; x++)
{
if (finList[x] == "void" || finList[x] == "_Bool" || finList[x] == "bool" || finList[x] == "unsigned")
{
finList[x] = im not sure what to do here
fin = string.Empty;
}
}
for (int x = 0; x < finList.Count; x++)
{
fin += finList[x];
}
Any direction or help would be much appreciated. I am relatively new to C# and C, so please be patient if I am not using the correct terms for anything. I think ending the string/line of the function at the ")" is what makes the most sense, but I am unsure how to do this.
Thanks in advance!
Quick and dirty solution would be something like this:
int bracketLevel = 0;
int squareBracketLevel = 0;
var methods = new List<string>();
var isMethodMode = true; // track if we are in method definition or in method body
var isMethod = false; // if we have seen parenthesis in definition
var builder = new StringBuilder();
for (int i = 0; i < fin.Length; i++)
{
if (isMethodMode)
{
switch (fin[i])
{
case '(':
isMethod = true;
builder.Append(fin[i]);
bracketLevel++;
break;
case ')':
builder.Append(fin[i]);
bracketLevel--;
break;
case '{':
if (bracketLevel > 0) continue;
if (isMethod)
{
methods.Add(builder.ToString().Trim());
builder.Clear();
isMethod = false;
}
isMethodMode = false;
squareBracketLevel++;
break;
default:
builder.Append(fin[i]);
break;
}
}
else
{
switch (fin[i])
{
case '{':
squareBracketLevel++;
break;
case '}':
squareBracketLevel--;
if (squareBracketLevel == 0)
{
isMethodMode = true;
}
break;
}
}
}
Variable fin contains loaded C file. While this works for your example there are couple of assumptions:
C code is valid (no mismatched parenthesis)
No comments which would contain parenthesis (this includes commented out functions as was already noted in comments)
Body block does not contain curly braces in string constants
If those assumptions do not hold for you, then you will have to take a look into parser generators which will parse C file and generate abstract syntax tree for you from which you can extract desired information. One example is ANTLR. C grammar is also available at C.g4.

Better regular expression for ReverseStringFormat

I've been using for a while this neat function found here on SO:
private List<string> ReverseStringFormat(string template, string str)
{
string pattern = "^" + Regex.Replace(template, #"\{[0-9]+\}", "(.*?)") + "$";
Regex r = new Regex(pattern);
Match m = r.Match(str);
List<string> ret = new List<string>();
for (int i = 1; i < m.Groups.Count; i++)
ret.Add(m.Groups[i].Value);
return ret;
}
This function is able to process correctly templates like:
My name is {0} and I'm {1} years old
While it fails with patterns like:
My name is {0} and I'm {1:00} years old
I would like to handle this failing scenario and add fixed length parsing.
The function transforms the (first) template as following:
My name is (.*?) and I'm (.*?) years old
I've been trying to write the above regular expression to limit the number of characters captured for the second group without success. This is my (terrible) attempt:
My name is (.*?) and I'm (.{2}) years old
I've been trying to process inputs like the following but the below PATTERN doesn't work:
PATTERN: My name is (.*?) (.{3})(.{5})
INPUT: My name is John 123ABCDE
EXPECTED OUTPUT: John, 123, ABCDE
Every suggestion is highly appreciated
It is highly unlikely that you will be able to measure the length of a captured group within the same Regex replacement.
I would strongly suggest you look at the following state machine implementation.
Please note that this implementation also solves the multiple curly brace escape feature of string.Format.
First you will need a state enum, very much like this one:
public enum State {
Outside,
OutsideAfterCurly,
Inside,
InsideAfterColon
}
Then you will need a nice way to iterate over each character in a string.
The string chars parameter represents your template parameter while the returning IEnumerable<string> represents consecutive parts of the resulting pattern:
public static IEnumerable<string> InnerTransmogrify(string chars) {
State state = State.Outside;
int counter = 0;
foreach (var #char in chars) {
switch (state) {
case State.Outside:
switch (#char) {
case '{':
state = State.OutsideAfterCurly;
break;
default:
yield return #char.ToString();
break;
}
break;
case State.OutsideAfterCurly:
switch (#char) {
case '{':
state = State.Outside;
break;
default:
state = State.Inside;
counter = 0;
yield return "(.";
break;
}
break;
case State.Inside:
switch (#char) {
case '}':
state = State.Outside;
yield return "*?)";
break;
case ':':
state = State.InsideAfterColon;
break;
default:
break;
}
break;
case State.InsideAfterColon:
switch (#char) {
case '}':
state = State.Outside;
yield return "{" + counter + "})";
break;
default:
counter++;
break;
}
break;
}
}
}
You could join the parts like so:
public static string Transmogrify(string chars) {
var parts = InnerTransmogrify(chars);
var result = string.Join("", parts);
return result;
}
And then wrap everything up, like you originally intended:
private List<string> ReverseStringFormat(string template, string str) {
string pattern = <<SOME_PLACE>> .Transmogrify(template);
Regex r = new Regex(pattern);
Match m = r.Match(str);
List<string> ret = new List<string>();
for (int i = 1; i < m.Groups.Count; i++)
ret.Add(m.Groups[i].Value);
return ret;
}
Hope you understand why the Regex language isn't expressive enough (at least as far as my understanding is concerned) for this sort of job.
The only way to solve your problem with regular expressions is using a custom matcher to replace the group capture length.
The code bellow does this in your example:
private static string PatternFromStringFormat(string template)
{
// replaces only elements like {0}
string firstPass = Regex.Replace(template, #"\{[0-9]+\}", "(.*?)");
// replaces elements like {0:000} using a custom matcher
string secondPass = Regex.Replace(firstPass, #"\{[0-9]+\:(?<len>[0-9]+)\}",
(match) =>
{
var len = match.Groups["len"].Value.Length;
return "(.{" + len + "*})";
});
return "^" + secondPass + "$";
}
private static List<string> ReverseStringFormat(string template, string str)
{
string pattern = PatternFromStringFormat(template);
Regex r = new Regex(pattern);
Match m = r.Match(str);
List<string> ret = new List<string>();
for (int i = 1; i < m.Groups.Count; i++)
ret.Add(m.Groups[i].Value);
return ret;
}

How can I encode Azure storage table row keys and partition keys?

I'm using Azure storage tables and I have data going in to the RowKey that has slashes in it. According to this MSDN page, the following characters are disallowed in both the PartitionKey and RowKey:
The forward slash (/) character
The backslash () character
The number sign (#) character
The question mark (?) character
Control characters from U+0000 to U+001F, including:
The horizontal tab (\t) character
The linefeed (\n) character
The carriage return (\r) character
Control characters from U+007F to U+009F
I've seen some people use URL encoding to get around this. Unfortunately there's a few glitches that can arise from this, such as being able to insert but unable to delete certain entities. I've also seen some people use base64 encoding, however this also can contain disallowed characters.
How can I encode my RowKey efficiently without running in to disallowed characters, or rolling my own encoding?
Updated 18-Aug-2020 for (new?) issue with '+' character in Azure Search. See comments from #mladenb below for background. Of note, the documentation page referenced does not exclude the '+' character.
When a URL is Base64 encoded, the only character that is invalid in an Azure Table Storage key column is the forward slash ('/'). To address this, simply replace the forward slash character with another character that is both (1) valid in an Azure Table Storage key column and (2) not a Base64 character. The most common example I have found (which is cited in other answers) is to replace the forward slash ('/') with the underscore ('_').
private static String EncodeUrlInKey(String url)
{
var keyBytes = System.Text.Encoding.UTF8.GetBytes(url);
var base64 = System.Convert.ToBase64String(keyBytes);
return base64.Replace('/','_').Replace('+','-');
}
When decoding, simply undo the replaced character (first!) and then Base64 decode the resulting string. That's all there is to it.
private static String DecodeUrlInKey(String encodedKey)
{
var base64 = encodedKey.Replace('-','+').Replace('_', '/');
byte[] bytes = System.Convert.FromBase64String(base64);
return System.Text.Encoding.UTF8.GetString(bytes);
}
Some people have suggested that other Base64 characters also need encoding. According to the Azure Table Storage docs this is not the case.
I ran into the same need.
I wasn't satisfied with Base64 encoding because it turns a human-readable string into an unrecognizable string, and will inflate the size of strings regardless of whether they follow the rules (a loss when the great majority of characters are not illegal characters that need to be escaped).
Here's a coder/decoder using '!' as an escape character in much the same way one would traditionally use the backslash character.
public static class TableKeyEncoding
{
// https://msdn.microsoft.com/library/azure/dd179338.aspx
//
// The following characters are not allowed in values for the PartitionKey and RowKey properties:
// The forward slash(/) character
// The backslash(\) character
// The number sign(#) character
// The question mark (?) character
// Control characters from U+0000 to U+001F, including:
// The horizontal tab(\t) character
// The linefeed(\n) character
// The carriage return (\r) character
// Control characters from U+007F to U+009F
public static string Encode(string unsafeForUseAsAKey)
{
StringBuilder safe = new StringBuilder();
foreach (char c in unsafeForUseAsAKey)
{
switch (c)
{
case '/':
safe.Append("!f");
break;
case '\\':
safe.Append("!b");
break;
case '#':
safe.Append("!p");
break;
case '?':
safe.Append("!q");
break;
case '\t':
safe.Append("!t");
break;
case '\n':
safe.Append("!n");
break;
case '\r':
safe.Append("!r");
break;
case '!':
safe.Append("!!");
break;
default:
if (c <= 0x1f || (c >= 0x7f && c <= 0x9f))
{
int charCode = c;
safe.Append("!x" + charCode.ToString("x2"));
}
else
{
safe.Append(c);
}
break;
}
}
return safe.ToString();
}
public static string Decode(string key)
{
StringBuilder decoded = new StringBuilder();
int i = 0;
while (i < key.Length)
{
char c = key[i++];
if (c != '!' || i == key.Length)
{
// There's no escape character ('!'), or the escape should be ignored because it's the end of the array
decoded.Append(c);
}
else
{
char escapeCode = key[i++];
switch (escapeCode)
{
case 'f':
decoded.Append('/');
break;
case 'b':
decoded.Append('\\');
break;
case 'p':
decoded.Append('#');
break;
case 'q':
decoded.Append('?');
break;
case 't':
decoded.Append('\t');
break;
case 'n':
decoded.Append("\n");
break;
case 'r':
decoded.Append("\r");
break;
case '!':
decoded.Append('!');
break;
case 'x':
if (i + 2 <= key.Length)
{
string charCodeString = key.Substring(i, 2);
int charCode;
if (int.TryParse(charCodeString, NumberStyles.HexNumber, NumberFormatInfo.InvariantInfo, out charCode))
{
decoded.Append((char)charCode);
}
i += 2;
}
break;
default:
decoded.Append('!');
break;
}
}
}
return decoded.ToString();
}
}
Since one should use extreme caution when writing your own encoder, I have written some unit tests for it as well.
using Xunit;
namespace xUnit_Tests
{
public class TableKeyEncodingTests
{
const char Unicode0X1A = (char) 0x1a;
public void RoundTripTest(string unencoded, string encoded)
{
Assert.Equal(encoded, TableKeyEncoding.Encode(unencoded));
Assert.Equal(unencoded, TableKeyEncoding.Decode(encoded));
}
[Fact]
public void RoundTrips()
{
RoundTripTest("!\n", "!!!n");
RoundTripTest("left" + Unicode0X1A + "right", "left!x1aright");
}
// The following characters are not allowed in values for the PartitionKey and RowKey properties:
// The forward slash(/) character
// The backslash(\) character
// The number sign(#) character
// The question mark (?) character
// Control characters from U+0000 to U+001F, including:
// The horizontal tab(\t) character
// The linefeed(\n) character
// The carriage return (\r) character
// Control characters from U+007F to U+009F
[Fact]
void EncodesAllForbiddenCharacters()
{
List<char> forbiddenCharacters = "\\/#?\t\n\r".ToCharArray().ToList();
forbiddenCharacters.AddRange(Enumerable.Range(0x00, 1+(0x1f-0x00)).Select(i => (char)i));
forbiddenCharacters.AddRange(Enumerable.Range(0x7f, 1+(0x9f-0x7f)).Select(i => (char)i));
string allForbiddenCharacters = String.Join("", forbiddenCharacters);
string allForbiddenCharactersEncoded = TableKeyEncoding.Encode(allForbiddenCharacters);
// Make sure decoding is same as encoding
Assert.Equal(allForbiddenCharacters, TableKeyEncoding.Decode(allForbiddenCharactersEncoded));
// Ensure encoding does not contain any forbidden characters
Assert.Equal(0, allForbiddenCharacters.Count( c => allForbiddenCharactersEncoded.Contains(c) ));
}
}
}
How about URL encode/decode functions. It takes care of '/', '?' and '#' characters.
string url = "http://www.google.com/search?q=Example";
string key = HttpUtility.UrlEncode(url);
string urlBack = HttpUtility.UrlDecode(key);
see these links
https://www.rfc-editor.org/rfc/rfc4648#page-7
Code for decoding/encoding a modified base64 URL (see also second answer: https://stackoverflow.com/a/1789179/1094268)
I had the problem myself. These are my own functions I use for this now. I use the trick in the second answer I mentioned, as well as changing up the + and / which are incompatible with azure keys that may still appear.
private static String EncodeSafeBase64(String toEncode)
{
if (toEncode == null)
throw new ArgumentNullException("toEncode");
String base64String = Convert.ToBase64String(Encoding.UTF8.GetBytes(toEncode));
StringBuilder safe = new StringBuilder();
foreach (Char c in base64String)
{
switch (c)
{
case '+':
safe.Append('-');
break;
case '/':
safe.Append('_');
break;
default:
safe.Append(c);
break;
}
}
return safe.ToString();
}
private static String DecodeSafeBase64(String toDecode)
{
if (toDecode == null)
throw new ArgumentNullException("toDecode");
StringBuilder deSafe = new StringBuilder();
foreach (Char c in toDecode)
{
switch (c)
{
case '-':
deSafe.Append('+');
break;
case '_':
deSafe.Append('/');
break;
default:
deSafe.Append(c);
break;
}
}
return Encoding.UTF8.GetString(Convert.FromBase64String(deSafe.ToString()));
}
If it is just the slashes, you can simply replace them on writing to the table with another character, say, '|' and re-replace them on reading.
What I have seen is that although alot of non-alphanumeric characters are technically allowed it doesn't really work very well as partition and row key.
I looked at the answears already given here and other places and wrote this:
https://github.com/JohanNorberg/AlphaNumeric
Two alpha-numeric encoders.
If you need to escape a string that is mostly alphanumeric you can use this:
AlphaNumeric.English.Encode(str);
If you need to escape a string that is mostly not alphanumeric you can use this:
AlphaNumeric.Data.EncodeString(str);
Encoding data:
var base64 = Convert.ToBase64String(bytes);
var alphaNumericEncodedString = base64
.Replace("0", "01")
.Replace("+", "02")
.Replace("/", "03")
.Replace("=", "04");
But, if you want to use for example an email adress as a rowkey you would only want to escape the '#' and '.'. This code will do that:
char[] validChars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ3456789".ToCharArray();
char[] allChars = rawString.ToCharArray();
StringBuilder builder = new StringBuilder(rawString.Length * 2);
for(int i = 0; i < allChars.Length; i++)
{
int c = allChars[i];
if((c >= 51 && c <= 57) || (c >= 65 && c <= 90) || (c >= 97 && c <= 122))
{
builder.Append(allChars[i]);
}
else
{
int index = builder.Length;
int count = 0;
do
{
builder.Append(validChars[c % 59]);
c /= 59;
count++;
} while (c > 0);
if (count == 1) builder.Insert(index, '0');
else if (count == 2) builder.Insert(index, '1');
else if (count == 3) builder.Insert(index, '2');
else throw new Exception("Base59 has invalid count, method must be wrong Count is: " + count);
}
}
return builder.ToString();

convert or figure formula which is contained parentheses

i need to find a way to conert treated formula(just using digits,letters and parentheses)
for example, for this input: '5(2(a)sz)' the output should be :'aaszaaszaaszaaszaasz'
i tried in that way:
string AddChainDeleteBracks(int open, int close, string input)
{
string to="",from="";
//get the local chain multipule the number in input[open-1]
//the number of the times the chain should be multiplied
for (int i = input[open - 1]; i > 0; i--)
{
//the content
for (int m = open + 1; m < close; m++)
{
to = to + input[m];
}
}
//get the chain i want to replace with "to"
for (int j = open - 1; j <= close; j++)
{
from = from + input[j];
}
String output = input.Replace(from, to);
return output;
}
but it doesn't work. Do u have a better idea to solve this?
You could store the opening parenthesis positions along with the number associated with that parenthesis in a stack (Last-in-First-out, e.g. System.Collections.Generic.Stack); then when you encounter the first (that is: next) closing parenthesis, pop the top of the stack: this will give you the beginning and ending position of the substring between the (so far most inner) parentheses you need to repeat. Then replace this portion of the original string (including the repetion number) with the repeated string. Continue until you reach the end of the string.
Things to be aware of:
when you do the replacement, you will need to update your current position so it now points to the end of the repetiotion string in the new (modified) string
depending whether 0 repetion is allowed, you might need to handle an empty repetition -- that is an empty string
when you reach the end of the string, the stack should be empty (all opening parentheses were matched with a closing one)
the stack might become empty in the middle of the string -- if you encounter a closing parentheses, the input string was malformed
there might be a way to escape the opening/cloding parentheses, so they don't count as part of the repetition pattern -- this depends on your requirements
Since the syntax of your expression is recursive, I suggest a recursive approach.
First split the expression into single tokens. I use Regex to do it and remove empty entries.
Example: "5(2(a)sz)" is split into "5", "(", "2", "(", "a", ")", "sz", ")"
Using an Enumerator enables you to get the tokens one by one. tokens.MoveNext() gets the next token. tokens.Current is the current token.
public string ConvertExpression(string expression)
{
IEnumerator<string> tokens = Regex.Split(expression, #"\b")
.Where(s => s != "")
.GetEnumerator();
if (tokens.MoveNext()) {
return Parse(tokens);
}
return "";
}
Here the main job is done in a recursive way
private string Parse(IEnumerator<string> tokens)
{
string s = "";
while (tokens.Current != ")") {
int n;
if (tokens.Current == "(") {
if (tokens.MoveNext()) {
s += Parse(tokens);
if (tokens.Current == ")") {
tokens.MoveNext();
return s;
}
}
} else if (Int32.TryParse(tokens.Current, out n)) {
if (tokens.MoveNext()) {
string subExpr = Parse(tokens);
var sb = new StringBuilder();
for (int i = 0; i < n; i++) {
sb.Append(subExpr);
}
s += sb.ToString();
}
} else {
s += tokens.Current;
if (!tokens.MoveNext())
return s;
}
}
return s;
}
Here is my second answer. My first answer was a quick shot. Here I tried to create a parser by doing the things one by one.
In order to convert an expression, you need to parse it. This means that you have to analyze its syntax. While analyzing its syntax you can produce an output as well.
1 The first thing to do, is to define the syntax of all the valid expressions.
Here I use EBNF to do it. EBNF is simple.
{ and } enclose repetitions (possibly zero).
[ and ] encloses an optional part.
| separates alternatives.
See Extended Backus–Naur Form (EBNF) on Wikpedia for more detailed information on EBNF. (The EBNF variant used here drops the concatenation operator ",").
Our syntax in EBNF
Expression = { Term }.
Term = [ Number ] Factor.
Factor = Text | "(" Expression ")" | Term.
Examples
5(2(a)sz) => aaszaaszaaszaaszaasz
5(2a sz) => aaszaaszaaszaaszaasz
2 3(a 2b)c => abbabbabbabbabbabbc
2 Lexical analysis
Before we analyze the syntax we have to split the whole expression into single lexical tokens (numbers, operators, etc.).
We use an enum to indicate the token type
private enum TokenType
{
None,
LPar,
RPar,
Number,
Text
}
The following fields are used to hold the token information and the Boolean _error which tells whether an error occurred during parsing.
private IEnumerator<Match> _matches;
TokenType _tokenType;
string _text;
int _number;
bool _error;
The method ConvertExpression starts the conversion. It splits the expression into single tokens represented as Regex.Matches.
Those are used by the method GetToken, which in turn converts the Regex.Matches into more useful information. This information is stored in the fields described above.
public string ConvertExpression(string expression)
{
_matches = Regex.Matches(expression, #"\d+|\(|\)|[a-zA-Z]+")
.Cast<Match>()
.GetEnumerator();
_error = false;
return GetToken() ? Expression() : "";
}
private bool GetToken()
{
_number = 0;
_tokenType = TokenType.None;
_text = null;
if (_error || !_matches.MoveNext())
return false;
_text = _matches.Current.Value;
switch (_text[0]) {
case '(':
_tokenType = TokenType.LPar;
break;
case ')':
_tokenType = TokenType.RPar;
break;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
_tokenType = TokenType.Number;
_number = Int32.Parse(_text);
break;
default:
_tokenType = TokenType.Text;
break;
}
return true;
}
3 Syntactic and Semantic Analysis
Now we have everything we need to perform the actual parsing and expression conversion. Each of the methods below analyses one EBNF syntax production and returns the result of the conversion as string.
The conversion of EBNF into C# code is straight forward. A repetition in the syntax is converted to a C# loop statement.
An option is converted to an if statement and alternatives are converted to a switch statement.
// Expression = { Term }.
private string Expression()
{
string s = "";
do {
s += Term();
} while (_tokenType != TokenType.RPar && _tokenType != TokenType.None);
return s;
}
// Term = [ Number ] Factor.
private string Term()
{
int n;
if (_tokenType == TokenType.Number) {
n = _number;
if (!GetToken()) {
_error = true;
return " Error: Factor expected.";
}
string factor = Factor();
if (_error) {
return factor;
}
var sb = new StringBuilder(n * factor.Length);
for (int i = 0; i < n; i++) {
sb.Append(factor);
}
return sb.ToString();
}
return Factor();
}
// Factor = Text | "(" Expression ")" | Term.
private string Factor()
{
switch (_tokenType) {
case TokenType.None:
_error = true;
return " Error: Unexpected end of Expression.";
case TokenType.LPar:
if (GetToken()) {
string s = Expression();
if (_tokenType == TokenType.RPar) {
GetToken();
return s;
} else {
_error = true;
return s + " Error ')' expected.";
}
} else {
_error = true;
return " Error: Unexpected end of Expression.";
}
case TokenType.RPar:
_error = true;
GetToken();
return " Error: Unexpected ')'.";
case TokenType.Text:
string t = _text;
GetToken();
return t;
default:
return Term();
}
}

What is a quick way to force CRLF in C# / .NET?

How would you normalize all new-line sequences in a string to one type?
I'm looking to make them all CRLF for the purpose of email (MIME documents). Ideally this would be wrapped in a static method, executing very quickly, and not using regular expressions (since the variances of line breaks, carriage returns, etc. are limited). Perhaps there's even a BCL method I've overlooked?
ASSUMPTION: After giving this a bit more thought, I think it's a safe assumption to say that CR's are either stand-alone or part of the CRLF sequence. That is, if you see CRLF then you know all CR's can be removed. Otherwise it's difficult to tell how many lines should come out of something like "\r\n\n\r".
input.Replace("\r\n", "\n").Replace("\r", "\n").Replace("\n", "\r\n")
This will work if the input contains only one type of line breaks - either CR, or LF, or CR+LF.
It depends on exactly what the requirements are. In particular, how do you want to handle "\r" on its own? Should that count as a line break or not? As an example, how should "a\n\rb" be treated? Is that one very odd line break, one "\n" break and then a rogue "\r", or two separate linebreaks? If "\r" and "\n" can both be linebreaks on their own, why should "\r\n" not be treated as two linebreaks?
Here's some code which I suspect is reasonably efficient.
using System;
using System.Text;
class LineBreaks
{
static void Main()
{
Test("a\nb");
Test("a\nb\r\nc");
Test("a\r\nb\r\nc");
Test("a\rb\nc");
Test("a\r");
Test("a\n");
Test("a\r\n");
}
static void Test(string input)
{
string normalized = NormalizeLineBreaks(input);
string debug = normalized.Replace("\r", "\\r")
.Replace("\n", "\\n");
Console.WriteLine(debug);
}
static string NormalizeLineBreaks(string input)
{
// Allow 10% as a rough guess of how much the string may grow.
// If we're wrong we'll either waste space or have extra copies -
// it will still work
StringBuilder builder = new StringBuilder((int) (input.Length * 1.1));
bool lastWasCR = false;
foreach (char c in input)
{
if (lastWasCR)
{
lastWasCR = false;
if (c == '\n')
{
continue; // Already written \r\n
}
}
switch (c)
{
case '\r':
builder.Append("\r\n");
lastWasCR = true;
break;
case '\n':
builder.Append("\r\n");
break;
default:
builder.Append(c);
break;
}
}
return builder.ToString();
}
}
Simple variant:
Regex.Replace(input, #"\r\n|\r|\n", "\r\n")
For better performance:
static Regex newline_pattern = new Regex(#"\r\n|\r|\n", RegexOptions.Compiled);
[...]
newline_pattern.Replace(input, "\r\n");
string nonNormalized = "\r\n\n\r";
string normalized = nonNormalized.Replace("\r", "\n").Replace("\n", "\r\n");
This is a quick way to do that, I mean.
It does not use an expensive regex function.
It also does not use multiple replacement functions that each individually did loop over the data with several checks, allocations, etc.
So the search is done directly in one for loop. For the number of times that the capacity of the result array has to be increased, a loop is also used within the Array.Copy function. That are all the loops.
In some cases, a larger page size might be more efficient.
public static string NormalizeNewLine(this string val)
{
if (string.IsNullOrEmpty(val))
return val;
const int page = 6;
int a = page;
int j = 0;
int len = val.Length;
char[] res = new char[len];
for (int i = 0; i < len; i++)
{
char ch = val[i];
if (ch == '\r')
{
int ni = i + 1;
if (ni < len && val[ni] == '\n')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) // Ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else if (ch == '\n')
{
int ni = i + 1;
if (ni < len && val[ni] == '\r')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) // Ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else
{
res[j++] = ch;
}
}
return new string(res, 0, j);
}
I now that '\n\r' is not actually used on basic platforms. But who would use two types of linebreaks in succession to indicate two linebreaks?
If you want to know that, then you need to take a look before to know if the \n and \r both are used separately in the same document.
Environment.NewLine;
A string containing "\r\n" for non-Unix platforms, or a string containing "\n" for Unix platforms.
str.Replace("\r", "").Replace("\n", "\r\n");
Converts both types of line breaks (\n and \n\r's) into CRLFs
on .NET 6 it's 35% faster than regex (Benchmarked using BenchmarkDotNet)

Categories