Split String using delimiter that exists in the string - c#

I have a problem and I am wondering if there is any smart workaround.
I need to pass a string through a socket to a web application. This string has three parts and I use the '|' as a delimiter to split at the receiving application into the three separate parts.
The problem is that the '|' character can be a character in any of the 3 separate strings and when this occurs the whole splitting action distorts the strings.
My question therefore is this:
Is there a way to use a char/string as a delimiter in some text while this char/string itself might be in the text?

The general pattern is to escape the delimiter character. E.g. when '|' is the delimiter, you could use "||" whenever you need the character itself inside a string (might be difficult if you allow empty strings) or you could use something like '\' as the escape character so that '|' becomes "\|" and "\" itself would be "\\"

The matter here is that given the following string:
string toParse = "What|do you|want|to|say|?";
It can be parsed in many several ways:
"What
do you
want|to|say|?"
or
"What|do you
want
to|say|?"
and so on...
You can define rules to parse your string, but coding it will be hard, and it will seem counter intuitive to the final user.
The string must contains an escape character that indicates that the symbol "|" is wanted, not the separator.
This could be for example "\|".
Here a full example using regex:
using System.Text.RegularExpressions;
//... Put this in the main method of a Console Application for instance.
// The '#' character before the strings are to specify "raw" strings, where escape characters '\' are not escaped
Regex reg = new Regex(#"^((?<string1>([^\|]|\\\|)+)\|)((?<string2>([^\|]|\\\|)+)\|)(?<string3>([^\|]|\\\|)+)$");
string toTest = #"user\|dureuill|deserves|an\|upvote";
MatchCollection matches = reg.Matches(toTest);
if (matches.Count != 1)
{
throw new FormatException("Bad formatted pattern.");
}
Match match = matches[0];
string string1 = match.Groups["string1"].Value.Replace(#"\|", "|");
string string2 = match.Groups["string2"].Value.Replace(#"\|", "|");
string string3 = match.Groups["string3"].Value.Replace(#"\|", "|");
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.ReadKey();

Is there a way to use a char/string as a delimiter in some text while
this char/string itself might be in the text?
Simple answer: No.
This is of course when the string/delimiter is exactly the same, without doing modifications to the text.
There are of course possible workarounds. One possible solution is that you might want to have a minimum/fixed width between delimiters, this is not perfect however.
Another possible solution is to select a delimiter (sequence of characters) that will never occur together in your text. This requires you to change the source and consumer.
When I need to use delimiters I normally select a delimiter that I am 99.9% sure will never occur in normal text, the delimiter may vary depending on what kind of text that I expect.
Here's a quote from Wikipedia:
Because delimiter collision is a very common problem, various methods
for avoiding it have been invented. Some authors may attempt to avoid
the problem by choosing a delimiter character (or sequence of
characters) that is not likely to appear in the data stream itself.
This ad-hoc approach may be suitable, but it necessarily depends on a
correct guess of what will appear in the data stream, and offers no
security against malicious collisions. Other, more formal conventions
are therefore applied as well.
Just a side note to your use-case, why not use a protocol for the data that is sent? Such as protobuf?

Maybe it is useful to HTMLEncode and HTMLDecode your strings first and then attach them together with your delimiter.

I think you either
1)Find a character or set of characters together that would never appear in the string
or
2)Use fixed length strings and pad.

Maybe adapt the delimeter if you have the flexibility to do this? So instead of String1|String2 the string could read "String1"|"String2".
If pipes are unwanted - put some simple validation in place during creation/entry of this string?

Instead of using | as delimiter, you could find a delimiter that's not present in the message parts and pass it along at the beginning of the sent message. Here's an example using an integer as delimiter:
String[] parts = {"this is a message", "it's got three parts", "this one's the last"};
String delimiter = null;
for (int i = 0; i < 100; i++) {
String s = Integer.toString(i);
if (parts[0].contains(s) || parts[1].contains(s) || parts[2].contains(s))
continue;
delimiter = s;
break;
}
String message = delimiter + "#" + parts[0] + delimiter + parts[1] + delimiter + parts[2];
Now the message is 0#this is a message0it's got three parts0this one's the last.
On the receiving end you start by finding the delimiter and split the message string on that:
String[] tmp = message.split("#", 2);
String[] parts = tmp[1].split(tmp[0]);
It's not the most efficient possible solution, since it requires scanning the message parts several times, but it's very easy to implement. If you don't find a value for delimiter and null happens to be part of the message, you might experience unexpected results.

Related

Regex.Split command in c#

I am trying to use Regex.SPlit to split a a string in order to keep all of its contents, including the delimiters i use. The string is a math problem. For example, 5+9/2*1-1. I have it working if the string contains a + sign but I don't know how to add more then one to the delimiter list. I have looked online at multiple pages but everything I try gives me errors. Here is the code for the Regex.Split line I have: (It works for the plus, Now i need it to also do -,*, and /.
string[] everything = Regex.Split(inputBox.Text, #"(\+)");
Use a character class to match any of the math operations: [*/+-]
string input = "5+9/2*1-1";
string pattern = #"([*/+-])";
string[] result = Regex.Split(input, pattern);
Be aware that character classes allow ranges, such as [0-9], which matches any digit from 0 up to 9. Therefore, to avoid accidental ranges, you can escape the - or place it at either the beginning or end of the character class.

C# Trouble with Regex.Replace

Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());

How would one trim all non-alphanumeric and numeric characters from the beginning and end of a string?

EDIT: I changed the title to reflect specifically what it is I'm trying to do.
Is there a way to retrieve all alphanumeric (or preferably, just the alphabet) characters for the current culture in .NET? My scenario is that I have several strings that I need to remove all numerals and non-alphabet characters from, and I'm not quite sure how I would implement this while honoring the alphabet of languages other than English (short of creating arrays of all alphabet characters for all supported languages of .NET, or at least the languages of our current clients lol)
UPDATE:
Specifically, what I'm trying to do is trim all non-alphabet chars from the start of the string up until the first alphabet character, and then from the last alphabet character to the end of the string. So for a random example in en-US, I want to turn:
()&*1#^#47*^#21%Littering aaaannnnd(*&^1#*32%#**)7(#9&^
into the following:
Littering aaaannnnd
This would be simple enough to do for English since it's my first language, but really in any culture I need to be able to remove numerals and other non-alphanumeric characters from the string.
string something = "()&*1#^#47*^#21%Littering aaaannnndóú(*&^1#*32%#**)7(#9&^";
string somethingNew = Regex.Replace(something, #"[^\p{L}-\s]+", "");
Is this what you're looking for?
Edit: Added to allow other languages characters. This will output Littering aaaannnndóú
Using regex method, this should work out:
string input = "()&*1#^#47*^#21%Littering aaaannnnd(*&^1#*32%#**)7(#9&^";
string result = Regex.Replace(input, "(?:^[^a-zA-Z]*|[^a-zA-Z]*$)", ""); //TRIM FROM START & END
Without using regex:
In Java, you could do:
while (true) {
if (word.length() == 0) {
return ""; // bad
}
if (!Character.isLetter(word.charAt(0))) {
word = word.substring(1);
continue; // so we are doing front first
}
if (!Character.isLetter(word.charAt(word.length()-1))) {
word = word.substring(0, word.length()-1);
continue; // then we are doing end
}
break; // if front is done, and end is done
}
If you are using something else, then java, substituting Character.isLetter is very straight forward, just search for character encoding and you will find the integer values for alphabetic characters, and you can use that to do it.

c#: how to quickly insert a character into string in front of any occurrences of special combinations of characters?

I have a string where I need to escape any occurrences of special combinations of characters. In other words, I need to stick a "\" in front any occurrence of any such combination. Most combinations are actually single characters (e.g a double quote or a backslash) but some are multi-character (e.g. "&&"). One approach is to create an array of strings with these combinations, loop over them and run a String.Replace(), with the backslash being checked the last to avoid recursive escaping. But is there a better (more elegant/quick/etc) way of doing it? Thx
Use your idea of Replace but using an StringBuilder instead (much better perfomance).
You can use Regex.Replace for this.
var input = #"abc'def&&aa\cc""ff";
var output = Regex.Replace(input, #"'|&&|""|\\", m => #"\" + m); // => "abc\'def\&&aa\\cc\"ff"
you can just take your entire string and run String.Replace() for each replacement type you want to do, As far as I know that is the quickest/most elegant way to do it. Thats why it is a built in method.

Remove all "invisible" chars from a string?

I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.
I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.
If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.
One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.
The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.
var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.
You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();
If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());

Categories