I wrote the following snippet to get rid of excessive spaces in slabs of text
int index = text.IndexOf(" ");
while (index > 0)
{
text = text.Replace(" ", " ");
index = text.IndexOf(" ");
}
Generally this works fine, albeit rather primative and possibly inefficient.
Problem
When the text contains " - " for some bizzare reason the indexOf returns a match!
The Replace function doesn't remove anything and then it is stuck in a endless loop.
Any ideas what is going on with the string.IndexOf?
Ah, the joys of text.
What you most likely have there, but got lost when posting on SO, is a "soft hyphen".
To reproduce the problem, I tried this code in LINQPad:
void Main()
{
var text = "Test1 \u00ad Test2";
int index = text.IndexOf(" ");
while (index > 0)
{
text = text.Replace(" ", " ");
index = text.IndexOf(" ");
}
}
And sure enough, the above code just gets stuck in a loop.
Note that \u00ad is the Unicode symbol for Soft Hyphen, according to CharMap. You can always copy and paste the character from CharMap as well, but posting it here on SO will replace it with its much more common cousin, the Hyphen-Minus, Unicode symbol u002d (the one on your keyboard.)
You can read a small section in the documentation for the String Class which has this to say on the subject:
String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.
I've highlighted the relevant part, but I also remember a blog post about this exact problem a while back but my Google-Fu is failing me tonight.
The problem here is that IndexOf and Replace use different methods for locating the text.
Whereas IndexOf will consider the soft hyphen as "not really there", and thus discover the two spaces on each side of it as "two joined spaces", the Replace method won't, and thus won't remove either of them. Therefore the criteria is present for the loop to continue iterating, but since Replace doesn't remove the spaces that fit the criteria, it will never end. Undoubtedly there are other such characters in the Unicode symbol space that exhibit similar problems, but this is the most typical case I've seen.
There's at least two ways of handling this:
You can use Regex.Replace, which seems to not have this problem:
text = Regex.Replace(text, " +", " ");
Personally I would probably use the whitespace special character in the Regular Expression, which is \s, but if you only want spaces, the above should do the trick.
You can explicitly ask IndexOf to use an ordinal comparison, which won't get tripped up by text behaving like ... well ... text:
index = text.IndexOf(" ", StringComparison.Ordinal);
Related
I want to replace the quotation marks in some strings. Although, this have to be done respectively and using « ». It is not definite that it will begin and end with quotation marks.
For example I have this string:
"THIS IS "inner1" THE MAIN "inner2" SENTENCE"
I want to change it to:
«THIS IS «inner1» THE MAIN «inner2» SENTENCE»
SOLUTION:
With much help from musefan (code is a bit different than his original solution since it is not definite that the string will begin and end with quotation marks). It is not done by linking in some way the pairs of quotation marks but by replacing them if they follow or followed by a whitespace and then check and apply replacement, if necessary, to the first and last character of the string provided.
using System;
public class Test
{
public static void Main()
{
string input = "\"THIS IS \"inner1\" THE MAIN \"inner2\" SENTENCE\"";
string result=input;
//Replace quotes that follow space with « and replace quotes that precede space with »
result = result.Replace(" \"", " «").Replace("\" ", "» ");
//if first character is " then replace with «
if (result.Substring(0, 1) == "\"")
result = "«" + result.Substring(1);
//get last character of the string
char last = result[result.Length - 1];
//if it is " then replace it with »
if (last.ToString() == "\"")
result = result.Remove(result.Length - 1) + "»";
Console.WriteLine(result);
}
}
The main problem is: how do you know when a quote should be the start of a new set, or the end of an existing one? There are many possible use cases that might require differently handling.
So, I have made the assumption that you are going to use space characters to work out if the quote is the start of a new set, or if it is the end of an existing one. The reason for this assumption is that it is the most obvious logic to ensure you get the desired result.
With that in mind, it becomes very simple:
// First remove the out quotes, we will manually change them at the end.
string result = input.Substring(1, input.Length - 2);
// Replace quotes that follow space with « and replace quotes that precede space with »
result = result.Replace(" \"", " «").Replace("\" ", "» ");
// Add the outer chevrons around the result.
result = string.Format("«{0}»", result);
Here is a working example.
Disclaimer: Please keep in mind that this answer is provided based on the sample data you have given. There are many possible inputs where it may be required to re-think the rules/logic in order to achieve the desired result. However, I cannot cater for that without knowing those additional requirements.
Feel free to edit your question if you have more specific requirements and I will try to update my answer, however you may need to prompt me with a comment so I know you have changed your requirements.
Stackoverflow has been very generous with answers to my regex questions so far, but with this one I'm blanking on what to do and just can't seem to find it answered here.
So I'm parsing a string, let's say for example's sake, a line of VB-esque code like either of the following:
Call Function ( "Str ing 1 ", "String 2" , " String 3 ", 1000 ) As Integer
Dim x = "This string should not be affected "
I'm trying to parse the text in order to eliminate all leading spaces, trailing spaces, and extra internal spaces (when two "words/chunks" are separated with two or more space or when there is one or more spaces between a character and a parentheses) using regex in C#. The result after parsing the above should look like:
Call Function("Str ing 1 ", "String 2", " String 3 ", 1000) As Integer
Dim x = "This string should not be affected "
The issue I'm running into is that, I want to parse all of the line except any text contained within quotation marks (i.e. a string). Basically if there are extra spaces or whatever inside a string, I want to assume that it was intended and move on without changing the string at all, but if there are extra spaces in the line text outside of the quotation marks, I want to parse and adjust that accordingly.
So far I have the following regex which does all of the parsing I mentioned above, the only issue is it will affect the contents of strings just like any other part of the line:
var rx = new Regex(#"\A\s+|(?<=\s)\s+|(?<=.)\s+(?=\()|(?<=\()\s+(?=.)|(?<=.)\s+(?=\))|\s+\z")
.
.
.
lineOfText = rx.Replace(lineOfText, String.Empty);
Anyone have any idea how I can approach this, or know of a past question answering this that I couldn't find? Thank you!
Since you are reading the file line by line, you can use the following fix:
("[^"]*(?:""[^"]*)*")|^\s+|(?<=\s)\s+|(?<=\w)\s+(?=\()|(?<=\()\s+(?=\w)|(?<=\w)\s+(?=\))|\s+$
Replace the matched text with $1 to restore the captured string literals that were captured with ("[^"]*(?:""[^"]*)*").
See demo
Been scratching my head all day about this one!
Ok, so I have a string which contains the following:
?\"width=\"1\"height=\"1\"border=\"0\"style=\"display:none;\">');
I want to convert that string to the following:
?\"width=1height=1border=0style=\"display:none;\">');
I could theoretically just do a String.Replace on "\"1\"" etc. But this isn't really a viable option as the string could theoretically have any number within the expression.
I also thought about removing the string "\"", however there are other occurrences of this which I don't want to be replaced.
I have been attempting to use the Regex.Replace method as I believe this exists to solve problems along my lines. Here's what I've got:
chunkContents = Regex.Replace(chunkContents, "\".\"", ".");
Now that really messes things up (It replaces the correct elements, but with a full stop), but I think you can see what I am attempting to do with it. I am also worrying that this will only work for single numbers (\"1\" rather than \"11\").. So that led me into thinking about using the "*" or "+" expression rather than ".", however I foresaw the problem of this picking up all of the text inbetween the desired characters (which are dotted all over the place) whereas I obviously only want to replace the ones with numeric characters in between them.
Hope I've explained that clearly enough, will be happy to provide any extra info if needed :)
Try this
var str = "?\"width=\"1\"height=\"1234\"border=\"0\"style=\"display:none;\">');";
str = Regex.Replace(str , "\"(\\d+)\"", "$1");
(\\d+) is a capturing group that looks for one or more digits and $1 references what the group captured.
This works
String input = #"?\""width=\""1\""height=\""1\""border=\""0\""style=\""display:none;\"">');";
//replace the entire match of the regex with only what's captured (the number)
String result = Regex.Replace(input, #"\\""(\d+)\\""", match => match.Result("$1"));
//control string for excpected result
String shouldBe = #"?\""width=1height=1border=0style=\""display:none;\"">');";
//prints true
Console.WriteLine(result.Equals(shouldBe).ToString());
I'm writing a little class to read a list of key value pairs from a file and write to a Dictionary<string, string>. This file will have this format:
key1:value1
key2:value2
key3:value3
...
This should be pretty easy to do, but since a user is going to edit this file manually, how should I deal with whitespaces, tabs, extra line jumps and stuff like that? I can probably use Replace to remove whitespaces and tabs, but, is there any other "invisible" characters I'm missing?
Or maybe I can remove all characters that are not alphanumeric, ":" and line jumps (since line jumps are what separate one pair from another), and then remove all extra line jumps. If this, I don't know how to remove "all-except-some" characters.
Of course I can also check for errors like "key1:value1:somethingelse". But stuff like that doesn't really matter much because it's obviously the user's fault and I would just show a "Invalid format" message. I just want to deal with the basic stuff and then put all that in a try/catch block just in case anything else goes wrong.
Note: I do NOT need any whitespaces at all, even inside a key or a value.
I did this one recently when I finally got pissed off at too much undocumented garbage forming bad xml was coming through in a feed. It effectively trims off anything that doesn't fall between a space and the ~ in the ASCII table:
static public string StripControlChars(this string s)
{
return Regex.Replace(s, #"[^\x20-\x7F]", "");
}
Combined with the other RegEx examples already posted it should get you where you want to go.
If you use Regex (Regular Expressions) you can filter out all of that with one function.
string newVariable Regex.Replace(variable, #"\s", "");
That will remove whitespace, invisible chars, \n, and \r.
One of the "white" spaces that regularly bites us is the non-breakable space. Also our system must be compatible with MS-Dynamics which is much more restrictive. First, I created a function that maps the 8th bit characters to their approximate 7th bit counterpart, then I removed anything that was not in the x20 to x7f range further limited by the Dynamics interface.
Regex.Replace(s, #"[^\x20-\x7F]", "")
should do that job.
The requirements are too fuzzy. Consider:
"When is a space a value? key?"
"When is a delimiter a value? key?"
"When is a tab a value? key?"
"Where does a value end when a delimiter is used in the context of a value? key"?
These problems will result in code filled with one off's and a poor user experience. This is why we have language rules/grammar.
Define a simple grammar and take out most of the guesswork.
"{key}":"{value}",
Here you have a key/value pair contained within quotes and separated via a delimiter (,). All extraneous characters can be ignored. You could use use XML, but this may scare off less techy users.
Note, the quotes are arbitrary. Feel free to replace with any set container that will not need much escaping (just beware the complexity).
Personally, I would wrap this up in a simple UI and serialize the data out as XML. There are times not to do this, but you have given me no reason not to.
var split = textLine.Split(":").Select(s => s.Trim()).ToArray();
The Trim() function will remove all the irrelevant whitespace. Note that this retains whitespace inside of a key or value, which you may want to consider separately.
You can use string.Trim() to remove white-space characters:
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = pair[0].Trim(),
Value = pair[1].Trim(),
};
}).ToList();
However, if you want to remove all white-spaces, you can use regular expressions:
var whiteSpaceRegex = new Regex(#"\s+", RegexOptions.Compiled);
var results = lines
.Select(line => {
var pair = line.Split(new[] {':'}, 2);
return new {
Key = whiteSpaceRegex.Replace(pair[0], string.Empty),
Value = whiteSpaceRegex.Replace(pair[1], string.Empty),
};
}).ToList();
If it doesn't have to be fast, you could use LINQ:
string clean = new String(tainted.Where(c => 0 <= "ABCDabcd1234:\r\n".IndexOf(c)).ToArray());
I have just faced this problem today and wonder if someone has any idea about why does this test may fail (depending on culture). The aim is to check if the test text contain two spaces next to each other, which does according to string.IndexOf (even if i tell the string to replace all occurrences of two spaces next to each other). After some testing it seems \xAD is somehow causing this issue.
public class ReplaceIndexOfSymmetryTest
{
[Test]
public void IndexOfShouldNotFindReplacedString()
{
string testText = "\x61\x20\xAD\x20\x62";
const string TWO_SPACES = " ";
const string ONE_SPACE = " ";
string result = testText.Replace(TWO_SPACES, ONE_SPACE);
Assert.IsTrue(result.IndexOf(TWO_SPACES) < 0);
}
}
Yes, I've come across the same thing before (although with different characters). Basically IndexOf will take various aspects of "special" Unicode characters into account when finding matches, whereas Replace just treats the strings as a sequence of code points.
From the IndexOf docs:
This method performs a word (case-sensitive and culture-sensitive) search using the current culture. The search begins at the first character position of this instance and continues until the last character position.
... and from Replace:
This method performs an ordinal (case-sensitive and culture-insensitive) search to find oldValue.
You could use the overload of IndexOf which takes a StringComparison, and force it to perform an ordinal comparison though.
Like Jon said, use StringComparison.Ordinal to get it right.
Assert.IsTrue(result.IndexOf(TWO_SPACES, StringComparison.Ordinal) < 0);