Culture specific characters to nice URL format - c#

I need some functionality to make the following string in a url-friendly format:
"knæ som gør" should be "kna-som-gor"
That is, replacing culture specific characters to characters that can be used in urls.
Using .Net and C#
Please help me :)
/Andreas

Don't complicate things. :)
Either use a regexp, or simply use String.Replace.

You can find a solution that removes diacritics here: How do I remove diacritics (accents) from a string in .NET?. This solution does not help you with æ or ø, though.
Maybe that removes enough of your special characters that the rest can be translated using simple replacing?
If "url-friendly" does not mean pretty, you could also use HttpUtility.UrlEncode, which produces
"kn%c3%a6+som+g%c3%b8r".

Edit: Added possible solution (end of post).
I had a very similar problem, albeit for file names rather than URLs. The main problem seems to be that there is no standard way to ask for the "best ASCII replacement for ø", so even if you can locate all the unwanted characters it is hard to automate which replacement to insert.
I posted quite a bit of code that might be helpful. See this StackOverflow question for details.
Edit: I think the solution to this problem lies with StringInfo, which allows you to iterate through the sub-characters (Unicode surrogates or combining characters) in a string. This should make it possible to detect and convert something like å (which can be encoded in Unicode as either A-WITH-RING or RINGED-A; filter out the decorator and keep the part that is a normal character).

Related

Converting "bad" characters to their equivalent without a direct string.Replace and a list

I have done my research and everything I've found either does nothing or is too Leeroy Jenkins and replaces everything else that it shouldn't. It's possible that I'm phrasing everything wrong in my search and so coming up with nothing.
I have to replace all the wrong characters that rich text programs (and older programs) autocorrect for the user because the user then copy/pasts directly into a web form.
For example, the "funky" apostrophe (’) converted to the regular apostrophe (') and the quotation marks and everything else.
I've tried UTF en/decoding, diacritic removal (not at all what I need), and a direct brute force string.Replace isn't reasonable, really.
Here's some example text that has all the bad stuff:
"They’re taking the hobbits to Isengaurd with bad apostrophe’s instead of good one's. It’s just how they roll."
Note that the only good apostrophe is in one's and already have one rendered result of this (It’s) so I need to convert it back (along with all the other baddies) without a string.Replace and a list of characters to watch for.
What ought I be doing here?
To clarify: I need to convert the bad characters to good equivalents before data is submitted AND I need to catch existing stuff that was rendered after it was saved. So I need to do two things here.

Regular expression for valid filename

I already gone through some question in StackOverflow regarding this but nothing helped much in my case.
I want to restrict the user to provide a filename that should contain only alphanumeric characters, -, _, . and space.
I'm not good in regular expressions and so far I came up with this ^[a-zA-Z0-9.-_]$. Can somebody help me?
This is the correct expression:
string regex = #"^[\w\-. ]+$";
\w is equivalent of [0-9a-zA-Z_].
To validate a file name i would suggest using the function provided by C# rather than regex
if (filename.IndexOfAny(System.IO.Path.GetInvalidFileNameChars()) != -1)
{
}
While what the OP asks is close to what the currently accepted answer uses (^[\w\-. ]+$), there might be others seeing this question who has even more specific constraints.
First off, running on a non-US/GB machine, \w will allow a wide range of unwanted characters from foreign languages, according to the limitations of the OP.
Secondly, if the file extension is included in the name, this allows all sorts of weird looking, though valid, filenames like file .txt or file...txt.
Thirdly, if you're simply uploading the files to your file system, you might want a blacklist of files and/or extensions like these:
web.config, hosts, .gitignore, httpd.conf, .htaccess
However, that is considerably out of scope for this question; it would require all sorts of info about the setup for good guidance on security issues. I thought I should raise the matter none the less.
So for a solution where the user can input the full file name, I would go with something like this:
^[a-zA-Z0-9](?:[a-zA-Z0-9 ._-]*[a-zA-Z0-9])?\.[a-zA-Z0-9_-]+$
It ensures that only the English alphabet is used, no beginning or trailing spaces, and ensures the use of a file extension with at least 1 in length and no whitespace.
I've tested this on Regex101, but for future reference, this was my "test-suite":
## THE BELOW SHOULD MATCH
web.config
httpd.conf
test.txt
1.1
my long file name.txt
## THE BELOW SHOULD NOT MATCH - THOUGH VALID
æøå.txt
hosts
.gitignore
.htaccess
In case someone else needs to validate filenames (including Windows reserved words and such), here's a full expression:
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|[\s\.])[^\\\/:*"?<>|]{1,254}\z
Extended expression (don't allow filenames starting with 2 dots, don't allow filenames ending in dots or whitespace):
\A(?!(?:COM[0-9]|CON|LPT[0-9]|NUL|PRN|AUX|com[0-9]|con|lpt[0-9]|nul|prn|aux)|\s|[\.]{2,})[^\\\/:*"?<>|]{1,254}(?<![\s\.])\z
Edit:
For the interested, here's a link to Windows file naming conventions:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx
use this regular expression ^[a-zA-Z0-9._ -]+$
This is a minor change to Engineers answer.
string regex = #"^[\w\- ]+[\w\-. ]*$"
This will block ".txt" which isn't valid.
Trouble is, it does block "..txt" which is valid
For full character set (Unicode) use
^[\p{L}0-9_\-.~]+$
or perhaps
^[\p{L}\p{N}_\-.~]+$
would be more accurate if we are talking about Unicode.
I added a '~' simply because I have some files using that character.
I've just created this. It prevents two dots and dot at end and beginning. It doesn't allow any two dots though.
^([a-zA-Z0-9_]+)\.(?!\.)([a-zA-Z0-9]{1,5})(?<!\.)$
When used in HTML5 via pattern:
<form action="" method="POST">
<fieldset>
<legend>Export Configuration</legend>
<label for="file-name">File Name</label>
<input type="text" required pattern="^[\w\-. ]+$" id="file-name" name="file_name"/>
</fieldset>
<button type="submit">Export Settings</button>
</form>
This will validate against all valid file names. You can remove required to prevent the native HTML5 validation.
I may be saying something stupid here, but it seems to me that these answers aren't correct. Firstly, are we talking Linux or Windows here (or another OS)?
Secondly, in Windows it is (I believe) perfectly legitimate to include a "$" in a filename, not to mention Unicode in general. It certainly seems possible.
I tried to get a definitive source on this... and ending up at the Wikip Filename page: in particular the section "Reserved characters and words" seems relevant: and these are, clearly, a list of things which you are NOT allowed to put in.
I'm in the Java world. And I naturally assumed that Apache Commons would have something like validateFilename, maybe in FilenameUtils... but it appears not (if it had done, this would still be potentially useful to C# programmers, as the code is usually pretty easy to understand, and could therefore be translated). I did do an experiment, though, using the method normalize: to my disappointment it allowed perfectly invalid characters (?, etc.) to "pass".
The part of the Wikip Filename page referenced above shows that this question depends on the OS you're using... but it should be possible to concoct some simple regex for Linux and Windows at least.
Then I found a Java way (at least):
Path path = java.nio.file.FileSystems.getDefault().getPath( 'bobb??::mouse.blip' );
output:
java.nio.file.InvalidPathException: Illegal char at index 4:
bobb??::mouse.blip
... presumably different FileSystem objects will have different validation rules
Copied from #Engineer for future reference as the dot was not escaped (as it should) in the most voted answer.
This is the correct expression:
string regex = #"^[\w\-\. ]+$";

Splitting string on commas when data can contain commas

I have a CSV file (which I didn't design and I can't change now nor will I ever be able to change it) that contains lines like the following:
"Surname, Firstname", yes, no, somestring, whatever, etc
As you can see here, the first , is not a comma on which I'd want to split the string. Notice that this particular comma is enclosed within the quotation marks.
Because of this, a simple string.split(',') obviously won't work, as it would give me an array of length 7 for the above string instead of 6.
Is there a way to get around this? I was thinking of using regex to split the string instead but I'm not competent enough in regex to think of a pattern that would only split on commas that are not enclosed inside quotation marks.
I can think of ugly, hacky ways to do it by reading each string char by char but this would have to be a last resort as I'm sure there's a better way to do it!
You can handle this easily by using the TextFieldParser class. Just set HasFieldsEnclosedInQuotes to true.
I would suggest using a CSV parser library - there are other cases that you wouldn't have thought of (new line as part of a quoted field).
The VisualBasic namespace has a nice library that can help - the TextFieldParser.
I know there's a lot of people here who think character-by-character comparisons should never be used and will strongly disagree with me but I'm not convinced companies like Microsoft aren't the only ones who should be doing that sort of programming.
Afterall, Split does character-by-character comparisons so why is it any less ugly when you call existing code that doesn't quite do exactly what you want?
At any rate, my approach was to write my own code. And I've posted the code online at http://www.blackbeltcoder.com/Articles/files/reading-and-writing-csv-files-in-c.

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

PHPs htmlspecialcharacters equivalent in .NET?

PHP has a great function called htmlspecialcharacters() where you pass it a string and it replaces all of HTML's special characters with their safe equivalents, it's almost a one stop shop for sanitizing input. Very nice right?
Well is there an equivalent in any of the .NET libraries?
If not, can anyone link to any code samples or libraries that do this well?
Try this.
var encodedHtml = HttpContext.Current.Server.HtmlEncode(...);
System.Web.HttpUtility.HtmlEncode(string)
Don't know if there's an exact replacement, but there is a method HtmlUtility.HtmlEncode that replaces special characters with their HTML equivalents. A close cousin is HtmlUtility.UrlEncode for rendering URL's. You could also use validator controls like RegularExpressionValidator, RangeValidator, and System.Text.RegularExpression.Regex to make sure you're getting what you want.
Actually, you might want to try this method:
HttpUtility.HtmlAttributeEncode()
Why? Citing the HtmlAttributeEncode page at MSDN docs:
The HtmlAttributeEncode method converts only quotation marks ("), ampersands (&), and left angle brackets (<) to equivalent character entities. It is considerably faster than the HtmlEncode method.
In an addition to the given answers:
When using Razor view engine (which is the default view engine in ASP.NET), using the '#' character to display values will automatically encode the displayed value. This means that you don't have to use encoding.
On the other hand, when you don't want the text being encoded, you have to specify that explicitly (by using #Html.Raw). Which is, in my opinion, a good thing from a security point of view.

Categories