Remove all problematic characters in an intelligent way in C#

Remove all problematic characters in an intelligent way in C# - c#

Is there any .Net library to remove all problematic characters of a string and only leave alphanumeric, hyphen and underscore (or similar subset) in an intelligent way? This is for using in URLs, file names, etc.
I'm looking for something similar to stringex which can do the following:
A simple prelude
"simple English".to_url =>
"simple-english"
"it's nothing at all".to_url =>
"its-nothing-at-all"
"rock & roll".to_url =>
"rock-and-roll"
Let's show off
"$12 worth of Ruby power".to_url =>
"12-dollars-worth-of-ruby-power"
"10% off if you act now".to_url =>
"10-percent-off-if-you-act-now"
You don't even wanna trust Iconv for this next part
"kick it en Français".to_url =>
"kick-it-en-francais"
"rock it Español style".to_url =>
"rock-it-espanol-style"
"tell your readers 你好".to_url =>
"tell-your-readers-ni-hao"

You can try this
string str = phrase.ToLower(); //optional
str = str.Trim();
str = Regex.Replace(str, #"[^a-z0-9\s_]", ""); // invalid chars
str = Regex.Replace(str, #"\s+", " ").Trim(); // convert multiple spaces into one space
str = str.Substring(0, str.Length <= 400 ? str.Length : 400).Trim(); // cut and trim it
str = Regex.Replace(str, #"\s", "-");

Perhaps this question here can help you on your way. It gives you code on how Stackoverflow generates its url's (more specifically, how question names are turned into nice urls.
Link to Question here, where Jeff Atwood shows their code

From your examples, the closest thing I've found (although I don't think it does everything that you're after) is:
My Favorite String Extension Methods in C#
and also:
ÜberUtils - Part 3 : Strings
Since neither of these solutions will give you exactly what you're after (going from the examples in your question) and assuming that the goal here is to make your string "safe", I'd second Hogan's advice and go with Microsoft's Anti Cross Site Scripting Library, or at least use that as a basis for something that you create yourself, perhaps deriving from the library.
Here's a link to a class that builds a number of string extension methods (like the first two examples) but leverages Microsoft's AntiXSS Library:
Extension Methods for AntiXss
Of course, you can always combine the algorithms (or similar ones) used within the AntiXSS library with the kind of algorithms that are often used in websites to generate "slug" URL's (much like Stack Overflow and many blog platforms do).
Here's an example of a good C# slug generator:
Improved C# Slug Generator

You could use HTTPUtility.UrlEncode, but that would encode everything, and not replace or remove problematic characters. So your spaces would be + and ' would be encoded as well. Not a solution, but maybe a starting point

If the goal is to make the string "safe" I recommend Mirosoft's anti-xss libary

There will be no library capable of what you want since you are stating specific rules that you want applied, e.g. $x => x-dollars, x% => x-percent. You will almost certainly have to write your own method to acheive this. It shouldn't be too difficult. A string extension method and use of one or more Regex's for making the replacements would probably be quite a nice concise way of doing it.
e.g.
public static string ToUrl(this string text)
{
return text.Trim().Regex.Replace(text, ..., ...);
}

Something the Ruby version doesn't make clear (but the original Perl version does) is that the algorithm it's using to transliterate non-Roman characters is deliberately simplistic -- "better than nothing" in both senses. For example, while it does have a limited capability to transliterate Chinese characters, this is entirely context-insensitive -- so if you feed it Japanese text then you get gibberish out.
The advantage of this simplistic nature is that it's pretty trivial to implement. You just have a big table of Unicode characters and their corresponding ASCII "equivalents". You could pull this straight from the Perl (or Ruby) source code if you decide to implement this functionality yourself.

I'm using something like this in my blog.
public class Post
{
public string Subject { get; set; }
public string ResolveSubjectForUrl()
{
return Regex.Replace(Regex.Replace(this.Subject.ToLower(), "[^\\w]", "-"), "[-]{2,}", "-");
}
}

I couldn't find any library that does it, like in Ruby, so I ended writing my own method. This is it in case anyone cares:
/// <summary>
/// Turn a string into something that's URL and Google friendly.
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
public static string ForUrl(this string str) {
return str.ForUrl(true);
}
public static string ForUrl(this string str, bool MakeLowerCase) {
// Go to lowercase.
if (MakeLowerCase) {
str = str.ToLower();
}
// Replace accented characters for the closest ones:
char[] from = "ÂÃÄÀÁÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöøùúûüýÿ".ToCharArray();
char[] to = "AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiidnoooooouuuuyy".ToCharArray();
for (int i = 0; i < from.Length; i++) {
str = str.Replace(from[i], to[i]);
}
// Thorn http://en.wikipedia.org/wiki/%C3%9E
str = str.Replace("Þ", "TH");
str = str.Replace("þ", "th");
// Eszett http://en.wikipedia.org/wiki/%C3%9F
str = str.Replace("ß", "ss");
// AE http://en.wikipedia.org/wiki/%C3%86
str = str.Replace("Æ", "AE");
str = str.Replace("æ", "ae");
// Esperanto http://en.wikipedia.org/wiki/Esperanto_orthography
from = "ĈĜĤĴŜŬĉĝĥĵŝŭ".ToCharArray();
to = "CXGXHXJXSXUXcxgxhxjxsxux".ToCharArray();
for (int i = 0; i < from.Length; i++) {
str = str.Replace(from[i].ToString(), "{0}{1}".Args(to[i*2], to[i*2+1]));
}
// Currencies.
str = new Regex(#"([¢€£\$])([0-9\.,]+)").Replace(str, #"$2 $1");
str = str.Replace("¢", "cents");
str = str.Replace("€", "euros");
str = str.Replace("£", "pounds");
str = str.Replace("$", "dollars");
// Ands
str = str.Replace("&", " and ");
// More aesthetically pleasing contractions
str = str.Replace("'", "");
str = str.Replace("’", "");
// Except alphanumeric, everything else is a dash.
str = new Regex(#"[^A-Za-z0-9-]").Replace(str, "-");
// Remove dashes at the begining or end.
str = str.Trim("-".ToCharArray());
// Compact duplicated dashes.
str = new Regex("-+").Replace(str, "-");
// Let's url-encode just in case.
return str.UrlEncode();
}

Related

How to strip a string from the point a hyphen is found within the string C#

I'm currently trying to strip a string of data that is may contain the hyphen symbol.
E.g. Basic logic:
string stringin = "test - 9894"; OR Data could be == "test";
if (string contains a hyphen "-"){
Strip stringin;
output would be "test" deleting from the hyphen.
}
Console.WriteLine(stringin);
The current C# code i'm trying to get to work is shown below:
string Details = "hsh4a - 8989";
var regexItem = new Regex("^[^-]*-?[^-]*$");
string stringin;
stringin = Details.ToString();
if (regexItem.IsMatch(stringin)) {
stringin = stringin.Substring(0, stringin.IndexOf("-") - 1); //Strip from the ending chars and - once - is hit.
}
Details = stringin;
Console.WriteLine(Details);
But pulls in an Error when the string does not contain any hyphen's.

How about just doing this?
stringin.Split('-')[0].Trim();
You could even specify the maximum number of substrings using overloaded Split constructor.
stringin.Split('-', 1)[0].Trim();

Your regex is asking for "zero or one repetition of -", which means that it matches even if your input does NOT contain a hyphen. Thereafter you do this
stringin.Substring(0, stringin.IndexOf("-") - 1)
Which gives an index out of range exception (There is no hyphen to find).
Make a simple change to your regex and it works with or without - ask for "one or more hyphens":
var regexItem = new Regex("^[^-]*-+[^-]*$");
here -------------------------^

It seems that you want the (sub)string starting from the dash ('-') if original one contains '-' or the original string if doesn't have dash.
If it's your case:
String Details = "hsh4a - 8989";
Details = Details.Substring(Details.IndexOf('-') + 1);

I wouldn't use regex for this case if I were you, it makes the solution much more complex than it can be.
For string I am sure will have no more than a couple of dashes I would use this code, because it is one liner and very simple:
string str= entryString.Split(new [] {'-'}, StringSplitOptions.RemoveEmptyEntries)[0];
If you know that a string might contain high amount of dashes, it is not recommended to use this approach - it will create high amount of different strings, although you are looking just for the first one. So, the solution would look like something like this code:
int firstDashIndex = entryString.IndexOf("-");
string str = firstDashIndex > -1? entryString.Substring(0, firstDashIndex) : entryString;

you don't need a regex for this. A simple IndexOf function will give you the index of the hyphen, then you can clean it up from there.
This is also a great place to start writing unit tests as well. They are very good for stuff like this.
Here's what the code could look like :
string inputString = "ho-something";
string outPutString = inputString;
var hyphenIndex = inputString.IndexOf('-');
if (hyphenIndex > -1)
{
outPutString = inputString.Substring(0, hyphenIndex);
}
return outPutString;

count occurrences in a string similar to run length encoding c#

Say I have a string like
MyString1 = "ABABABABAB";
MyString2 = "ABCDABCDABCD";
MyString3 = "ABCAABCAABCAABCA";
MyString4 = "ABABACAC";
MyString5 = "AAAAABBBBB";
and I need to get the following output
Output1 = "5(AB)";
Output2 = "3(ABCD)";
Output3 = "4(ABCA)";
Output4 = "2(AB)2(AC)";
Output5 = "5(A)5(B)";
I have been looking at RLE but I can't figure out how to do the above.
The code I have been using is
public static string Encode(string input)
{
return Regex.Replace(input, #"(.)\1*", delegate(Match m)
{
return string.Concat(m.Value.Length, "(", m.Groups[1].Value, ")");
});
}
This works for Output5 but can I do the other Outputs with Regex or should I be using something like Linq?
The purpose of the code is to display MyString in a simple manner as I can get MyString being up to a 1000 characters generally with a pattern to it.
I am not too worried about speed.

Using RLE with single characters is easy, there never is an overlap between matches. If the number of characters to repeat is variable, you'd have a problem:
AAABAB
Could be:
3(A)BAB
Or
AA(2)AB
You'll have to define what rules you want to apply. Do you want the absolute best compression? Does speed matter?
I doubt Regex can look forward and select "the best" combination of matches - So to answer your question I would say "no".

RLE is of no help here - it's just an extremely simple compression where you repeat a single code-point a given number of times. This was quite useful for e.g. game graphics and transparent images ("next, there's 50 transparent pixels"), but is not going to help you with variable-length code-points.
Instead, have a look at Huffman encoding. Expanding it to work with variable-length codewords is not exactly cheap, but it's a start - and it saves a lot of space, if you can afford having the table there.
But the first thing you have to ask yourself is, what are you optimizing for? Are you trying to get the shortest possible string on output? Are you going for speed? Do you want as few code-words as possible, or do you need to balance the repetitions and code-word counts in some way? In other words, what are you actually trying to do? :))
To illustrate this on your "expected" return values, Output4 results in a longer string than MyString4. So it's not the shortest possible representation. You're not trying for the least amounts of code-words either, because then Output5 would be 1(AAAAABBBBB). Least amount of repetitions is of course silly (it would always be 1(...)). You're not optimizing for low overhead either, because that's again broken in Output4.
And whichever of those are you trying to do, I'm thinking it's not going to be possible with regular expressions - those only work for regular languages, and encoding like this doesn't seem all that regular to me. The decoding does, of course; but I'm not so sure about the encoding.

Here is a Non-Regex way given the data that you provided. I'm not sure of any edge cases, right now, that would trip this code up. If so, I'll update accordingly.
string myString1 = "ABABABABAB";
string myString2 = "ABCDABCDABCD";
string myString3 = "ABCAABCAABCAABCA";
string myString4 = "ABABACAC";
string myString5 = "AAAAABBBBB";
CountGroupOccurrences(myString1, "AB");
CountGroupOccurrences(myString2, "ABCD");
CountGroupOccurrences(myString3, "ABCA");
CountGroupOccurrences(myString4, "AB", "AC");
CountGroupOccurrences(myString5, "A", "B");
CountGroupOccurrences() looks like the following:
private static void CountGroupOccurrences(string str, params string[] patterns)
{
string result = string.Empty;
while (str.Length > 0)
{
foreach (string pattern in patterns)
{
int count = 0;
int index = str.IndexOf(pattern);
while (index > -1)
{
count++;
str = str.Remove(index, pattern.Length);
index = str.IndexOf(pattern);
}
result += string.Format("{0}({1})", count, pattern);
}
}
Console.WriteLine(result);
}
Results:
5(AB)
3(ABCD)
4(ABCA)
2(AB)2(AC)
5(A)5(B)
UPDATE
This worked with Regex
private static void CountGroupOccurrences(string str, params string[] patterns)
{
string result = string.Empty;
foreach (string pattern in patterns)
{
result += string.Format("{0}({1})", Regex.Matches(str, pattern).Count, pattern);
}
Console.WriteLine(result);
}

Shorthand way to remove last forward slash and trailing characters from string

If I have the following string:
/lorem/ipsum/dolor
and I want this to become:
/lorem/ipsum
What is the short-hand way of removing the last forward slash, and all characters following it?
I know how I can do this by spliting the string into a List<> and removing the last item, and then joining, but is there a shorter way of writing this?
My question is not URL specific.

You can use Substring() and LastIndexOf():
str = str.Substring(0, str.LastIndexOf('/'));
EDIT (suggested comment)
To prevent any issues when the string may not contain a /, you could use something like:
int lastSlash = str.LastIndexOf('/');
str = (lastSlash > -1) ? str.Substring(0, lastSlash) : str;
Storing the position in a temp-variable would prevent the need to call .LastIndexOf('/') twice, but it could be dropped in favor of a one-line solution instead.

If there is '/' at the end of the url, remove it.
If not; just return the original one.
var url = this.Request.RequestUri.ToString();
url = url.EndsWith("/") ? url.Substring(0, url.Length - 1) : url;
url += #"/mycontroller";

You can do something like str.Remove(str.LastIndexOf("/")), but there is no built-in method to do what you want.
Edit: you could also use the Uri object to traverse directories, although it does not give exactly what you want:
Uri baseUri = new Uri("http://domain.com/lorem/ipsum/dolor");
Uri myUri = new Uri(baseUri, ".");
// myUri now contains http://domain.com/lorem/ipsum/

One simple way would be
String s = "domain.com/lorem/ipsum/dolor";
s = s.Substring(0, s.LastIndexOf('/'));
Console.WriteLine(s);
Another maybe
String s = "domain.com/lorem/ipsum/dolor";
s = s.TrimEnd('/');
Console.WriteLine(s);

You can use the regex /[^/]*$ and replace with the empty string:
var fixed = new Regex("/[^/]*$").Replace("domain.com/lorem/ipsum/dolor", "")
But it's probably overkill here. #newfurniturey's answer of Substring with LastIndexOf is probably best.

I like to create a String Extension for stuff like this:
/// <summary>
/// Returns with suffix removed, if present
/// </summary>
public static string TrimIfEndsWith(
this string value,
string suffix)
{
return
value.EndsWith(suffix) ?
value.Substring(0, value.Length - suffix.Length) :
value;
}
You can then use like this:
var myString = "/lorem/ipsum/dolor";
myStringClean = myString.TrimIfEndsWith("/dolor");
You now have a re-usable extension across all of your projects that can be used to remove one trailing character or multiple.

using System.IO;
mystring.TrimEnd(Path.AltDirectorySeparatorChar); // To remove "/"
mystring.TrimEnd(Path.DirectorySeparatorChar); // To remove "\"

while (input.Last() == '/' || input.Last() == '\\')
{
input = input.Substring(0, input.Length - 1);
}

Thank you #Curt for your question.
I slightly improved #newfurniturey's code, and here is my version.
if(str.Contains('/')){
str = str.Substring(0, str.LastIndexOf('/'));
}

I'm way late to the party, but if you're using C# 8.0+, another clean approach would be to use the range operator:
if (urlStr.EndsWith("/")) urlStr = urlStr[..^1];
If you're curious as to how this works, take a look at the spec for ranges in C#:
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-8.0/ranges
tldr; urlStr[..^1] roughly translates to something along the lines of "Give me a substring comprised of the characters contained within the range of index 0 to whatever index is 1 away from the last index.".
In other words, it's similar to...
urlStr.Substring(0, urlStr.Length-1)

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve

This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.

I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.

I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.

Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)

I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).

How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");

If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

best way to turn a post title into an URL in c#

I was wondering which is the best way to turn a string (e.g. a post title) into a descriptive URL.
the simplest way that comes to mind is by using a regex, such in:
public static Regex regex = new Regex(
"\\W+",
RegexOptions.IgnoreCase
| RegexOptions.CultureInvariant
| RegexOptions.IgnorePatternWhitespace
| RegexOptions.Compiled
);
string result = regex.Replace(InputText,"_");
which turns
"my first (yet not so bad) cupcake!! :) .//\."
into
my_first_yet_not_so_bad_cupcake_
then I can strip the last "_" and check it against my db and see if it's yet present. in that case I would add a trailing number to make it unique and recheck.
I could use it in, say
http://myblogsite.xom/posts/my_first_yet_not_so_bad_cupcake
but, is this way safe? should i check other things (like the length of the string)
is there any other, better method you prefer?
thanks

Here's what I do. regStripNonAlpha removes all the non-alpha or "-" characters. Trim() removes trailing and leading spaces (so we don't end up with dashes on either side). regSpaceToDash converts spaces (or runs of spaces) into a single dash. This has worked well for me.
static Regex regStripNonAlpha = new Regex(#"[^\w\s\-]+", RegexOptions.Compiled);
static Regex regSpaceToDash = new Regex(#"[\s]+", RegexOptions.Compiled);
public static string MakeUrlCompatible(string title)
{
return regSpaceToDash.Replace(
regStripNonAlpha.Replace(title, string.Empty).Trim(), "-");
}

string result = regex.Replace(InputText,"-");
instead of under score put hypen (-) that would give added advantage for Google search engine.
See below post for more details
http://www.mattcutts.com/blog/dashes-vs-underscores/

Here's a method I wrote not too long ago that takes a string and formats it to a permalink.
private string FormatPermalink(string title)
{
StringBuilder result = new StringBuilder();
title = title.Trim();
bool lastOneChanged = false;
for (int i = 0; i < title.Length; i++)
{
char c = title[i];
if (!char.IsLetterOrDigit(c))
{
c = '_';
if (lastOneChanged)
{
continue;
}
lastOneChanged = true;
}
else
{
lastOneChanged = false;
}
result.Append(c);
}
if (result[result.Length - 1] == '_') //if last one is underscore, remove
{
result = result.Remove(result.Length - 1, 1);
}
return result.ToString();
}
This takes into account special characters as well, so if the title has a special character, it just ignores it and moves on to the next one.

You could look into a URL re-writing HTTPModule. There are many examples on the net.
Once implemented in your web.config you simply specify the regular expression to map to the "real" page using the SEO friendly name
<!-- Rule 1: example... "/admin/somepage" redirects to..."/UI/Forms/Admin/frmPage.aspx" -->
<add key="^/admin/(.*)" value="/UI/Forms/Admin/frm$1.aspx" />

If you want to avoid doing this yourself, an HttpModule like http://urlrewriter.net/
could help. It's pretty good but requires a bit setting up.

Personally, I'd couple your special character removing with a date so your example would look like:
http://myblogsite.xom/posts/2009/04/03/my_first_yet_not_so_bad_cupcake
That way, if you content with the same title, it gets differentiated by date too. I see this often on some blogs I visit where they use "Five Random Things Make A Post" a lot (but not within the same day).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.