How to handle Japanese names in url?

How to handle Japanese names in url? - c#

The method described here URL Slugify algorithm in C#? returns "" for the input ブルノ.
Is it ok to include Japanese characters in URLs or does it hurt SEO? Will Google/Bing display as ブルノ or %E3%83%96%E3%83%AB%E3%83%8E?
What should I do for the user アウロン?
/user/1/auron (different field to set the name in url)
/user/1/アウロン (set his displayname in url)
ja.wikipedia.org does use Japanese characters in the URL, so can I just assume it is safe? Or does it need something else?

I suggest you do not use the second choice 2./user/1/アウロン.
The reason is, if I do not have the japanses language installed on my machine (I use Windows 7) and you send me a link like www.abc.com/user/1/アウロン it will display a link to www.abc.com/user/1/ instead of www.abc.com/user/1/アウロン.
I noticed this behavior when I sent a link that contains Thai characters to my coworkers on Skype. It linked to incorrect link like what I said in the previous paragraph.

Related

Replacing URL encoded data with their symbols

Hi I am trying to format my Url in order for it to look more user friendly.So far I managed to replace spaces with "-" but it seems that there are special characters like # and : that display as encoded data.This is what I mean:
http://localhost:51208/Home/Details/C%23-in-Depth%2c-Second-Edition/BookId-3
The "#" symbol is displayed as %23 and the "," is displayed as %2c.I would like to be able to replace this encoding with their original symbols.
Does such a way exist?

Oh no, you totally don't want to replace it with #. This symbol has a special meaning in an url. It represents the fragment identifier and its value is never sent to the server. This basically means that if there's a # symbol in your url, everything that follows it gets truncated and never sent to the server. You may take a look at the following post to see what StackOverflow uses to format the slug in the question title. You could run your string through this replace function in order to make sure that no dangerous characters are left.
I would also recommend you reading the following blog post from Scott Hanselman where he covers the various scenarios you might encounter with IIS if you attempt to send special characters in the path portion of your url. I am quoting his conclusion here:
After ALL this effort to get crazy stuff in the Request Path, it's
worth mentioning that simply keeping the values as a part of the Query
String (remember WAY back at the beginning of this post?) is easier,
cleaner, more flexible, and more secure.

Just replace "#" with "sharp" and ":" with "-", you cannot just put those special characters in the url

url encoding c# mismatch encoding

I have a large url that I am encoding using System.Web.HttpUtility.UrlEncode. When I encode it its not encoding it like the working example I have. I am not sure what the problem is, maybe different character type or something, I put an example of what suppose to be created and what actually being created. thanks for any help, i am lost on this one.
Working exmaple (look how this one has Did%252Citag%252 and the other doesnt)
22%7Chttp%3A%2F%2Fv17.nonxt1.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D22%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D8AD67D74F34FBAFBBA87616C0AED4A336DF0982A.129E2B5E648F8A2F35A34F312AC5C3C957A1C40A%26key%3Dlh1%2C35%7Chttp%3A%2F%2Fv18.nonxt3.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D35%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D7A58A11994C710872E945D0EAA6E43B6BFB8A648.B9C1D9FB377E1A49EBF3DC6C166C0B6E3E94EC24%26key%3Dlh1%2C34%7Chttp%3A%2F%2Fv6.nonxt1.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D34%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D260B10850A3448C849B8B8F1F2AF5E31244E71BC.6D7420FD66B85D40982BFB2C847EDB46021C63AE%26key%3Dlh1%2C5%7Chttp%3A%2F%2Fv23.nonxt7.googlevideo.com%2Fvideoplayback%3Fid%3D0b608733ae5257c3%26itag%3D5%26source%3Dpicasa%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1333533157%26sparams%3Did%252Citag%252Csource%252Cip%252Cipbits%252Cexpire%26signature%3D9894DCDA7D2634EE0006CE0F6E0E29ABF7A8F253.18765D7CD7BDE80ED1A47DC8EC559C3E05C92F56%26key%3Dlh1
Here is an example of the one I am creating (see this one encodes as did%2citag%2)
5%7chttp%3a%2f%2fv23.nonxt7.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d5%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3dC0E2993011931D9F5FCAFAF54E821415F6042DDD.477CD23B021563A6DE30E858E35C21046E0B0BA6%26key%3dlh1%2c18%7chttp%3a%2f%2fv11.nonxt4.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d18%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3d696501A8ACBA0E1246173B040E0FB81DA8EBCDC7.944BA6C08C630EFFC2456D66BAD12376D7E377B2%26key%3dlh1%2c34%7chttp%3a%2f%2fv6.nonxt1.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d34%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3dDDD3D9081F7F2FF462D17CFAE6CAB72AEB86DEA9.3275E0EE8921EF728132035FC94BEF5926A0B7C1%26key%3dlh1%2c35%7chttp%3a%2f%2fv18.nonxt3.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d35%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3d7826E7470450F9F473BC7A845967EF3AC655CFB.3850F952F5D68151D325CD754C581CD66B0BC4D7%26key%3dlh1%2c22%7chttp%3a%2f%2fv17.nonxt1.googlevideo.com%2fvideoplayback%3fid%3d0b608733ae5257c3%26itag%3d22%26source%3dpicasa%26ip%3d0.0.0.0%26ipbits%3d0%26expire%3d1333562840%26sparams%3did%2citag%2csource%2cip%2cipbits%2cexpire%26signature%3d32FAAE6AE74B22BFB3DBD4300CEEDBC1A12A9ED4.8014678ABB1AEE93FB4B1C36E2C74C89102DC112%26key%3dlh1

Looks like in the first example the URL is double encoded. Meaning if you look at decoded sparams parameter it is represented as
sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire
In your second example
sparams=id,itag,source,ip,ipbits,expire
So, what is happening in the first example is that, they are doing a UrlEncode on the value first. Using this value Construct the URL and then do UrlEncode on the constructed URL.
UPDATE : This is a general practice to be followed if the value of your querystring contains values which needs to be UrlEncoded (eg. , & space ? etc)

According to w3c standards, your example is fine. There is no %252 symbol.

I'm not sure exactly what you are expecting, but when you fire these strings into a URL Decoder, this is what you get:
String 1
22|http://v17.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=22&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=8AD67D74F34FBAFBBA87616C0AED4A336DF0982A.129E2B5E648F8A2F35A34F312AC5C3C957A1C40A&key=lh1,35|http://v18.nonxt3.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=35&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=7A58A11994C710872E945D0EAA6E43B6BFB8A648.B9C1D9FB377E1A49EBF3DC6C166C0B6E3E94EC24&key=lh1,34|http://v6.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=34&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=260B10850A3448C849B8B8F1F2AF5E31244E71BC.6D7420FD66B85D40982BFB2C847EDB46021C63AE&key=lh1,5|http://v23.nonxt7.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=5&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333533157&sparams=id%2Citag%2Csource%2Cip%2Cipbits%2Cexpire&signature=9894DCDA7D2634EE0006CE0F6E0E29ABF7A8F253.18765D7CD7BDE80ED1A47DC8EC559C3E05C92F56&key=lh1
String 2
5|http://v23.nonxt7.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=5&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=C0E2993011931D9F5FCAFAF54E821415F6042DDD.477CD23B021563A6DE30E858E35C21046E0B0BA6&key=lh1,18|http://v11.nonxt4.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=18&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=696501A8ACBA0E1246173B040E0FB81DA8EBCDC7.944BA6C08C630EFFC2456D66BAD12376D7E377B2&key=lh1,34|http://v6.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=34&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=DDD3D9081F7F2FF462D17CFAE6CAB72AEB86DEA9.3275E0EE8921EF728132035FC94BEF5926A0B7C1&key=lh1,35|http://v18.nonxt3.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=35&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=7826E7470450F9F473BC7A845967EF3AC655CFB.3850F952F5D68151D325CD754C581CD66B0BC4D7&key=lh1,22|http://v17.nonxt1.googlevideo.com/videoplayback?id=0b608733ae5257c3&itag=22&source=picasa&ip=0.0.0.0&ipbits=0&expire=1333562840&sparams=id,itag,source,ip,ipbits,expire&signature=32FAAE6AE74B22BFB3DBD4300CEEDBC1A12A9ED4.8014678ABB1AEE93FB4B1C36E2C74C89102DC112&key=lh1
You URLS are quite different, and they also have a leading chars that I'm not sure you are wanting.

Character / Language

I just developed a simple asp.net mvc application project for English only. I want to block user's any input for a language other than English. Is it possible to know whether user inputs other languages when they write something on textbox or editor in order to give a popup message?

You could limit the input box to latin characters, but there's no automatic way to see if the user entered something in say English, Finnish or Norwegian. They all mostly use a-z. Any character outside of a-z could give you an indication, but certain accents needs to be allowed in English as well, so it's not 100%.
Google Translate exposes a javascript API to detect the language of text.

Use the following code:
<p>Note that this community uses the English language exclusively, so please be
considerate and write your posts in English. Thank you!</p>

there are two tests you can do. one is to find out what the cultureinfo is set on the users machine:
http://msdn.microsoft.com/en-us/library/system.threading.thread.currentuiculture.aspx
this will give you their current culture setting, which is a start. of course, you can have your setting as 'english' but still typing in russian, and most of the letters will be the same..
so the next step is to discover the language using this: http://www.google.com/uds/samples/language/detect.html
it's not the greatest, according to online discussions, but its a place to start. I'm sure there are better natural language identifiers out there, though.

Checking for Latin 26
If you wanted to ensure that any non-English letters were submitted, you could simply validate that they fall outside the A-Z, a-z, 0-9 and normal punctuation ranges. It sounds like you want the regular non-Latin characters to be detected and rejected.
Detecting the user's OS settings, keyboard settings isn't the best way, as the user could have multiple keyboards attached, and have use of copy/paste.
UI Validation
At the user interface level, you could create a jQuery method that would check the value of a textbox for a value other than your acceptable range. Perhaps that's A-Z, a-z and numeric. You could do this on event onBlur. Remember that you might want to allow ', .
$('#customerName').blur(function() {
var isAlphaNumeric;
//implementation of checking a-z, A-Z, 0-9, etc.
alert(isAlphaNumeric);
});
Controller Validation
If you wanted to ALSO implement this at the controller level, you could run a regex on the incoming values.
public ActionMethod CreateCustomer(string custName)
{
if (IsAcceptableRange(custName))
{
//continue
}
}
public bool IsAcceptableRange(string input)
{
//whitelist all the valid inputs here. be sure to include
//space, period, apostrophe, hypen, etc
Regex alphaNumericPattern=new Regex("[^a-zA-Z0-9]");
return !alphaNumericPattern.IsMatch(input);
}

Google Translate was quoted in two answers, but I want to add that Microsoft Word API may also be used to detect language, just like Word does for check spelling.
It is for sure not the best solution, since language detection by Microsoft Office doesn't work very well (IMHO), but may be an alternative if doing web requests to Google or other remote service on every posted message is not a solution.
Also, check spelling through Microsoft Word API can be useful too. If a message has a huge number of misspelled words when checking in English, it's probably because the message is written in another language (or the author of the message writes too badly, too).
Finally, I completely agree with Matti Virkkunen. The best, and maybe the only way to ensure that messages will be written in English is to ask the users to write in English. Otherwise, it's just as bad as implementing obscenity filters.

Remove anchor from URL in C#

I'm trying to pull in an src value from an XML document, and in one that I'm testing it with, the src is:
<content src="content/Orwell - 1984 - 0451524934_split_2.html#calibre_chapter_2"/>
That creates a problem when trying to open the file. I'm not sure what that #(stuff) suffix is called, so I had no luck searching for an answer. I'd just like a simple way to remove it if possible. I suppose I could write a function to search for a # and remove anything after, but that would break if the filename contained a # symbol (or can a file even have that symbol?)
Thanks!

If you had the src in a string you could use
srcstring.Substring(0,srcstring.LastIndexOf("#"));
Which would return the src without the #. If the values you are retreiving are all web urls then this should work, the # is a bookmark in a url that takes you to a specific part of the page.

You should be OK assuming that URLs won't contain a "#"
The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it.
Source (search for "#" or "unsafe").
Therefore just use String.Split() with the "#" as the split character. This should give you 2 parts. In the highly unlikely event it gives more, just discard the last one and rejoin the remainder.

From Wikipedia:
# is used in a URL of a webpage or other resource to introduce a "fragment identifier" – an id which defines a position within that resource. For example, in the URL http://en.wikipedia.org/wiki/Number_sign#Other_uses the portion after the # (Other_uses) is the fragment identifier, in this case indicating that the display should be moved to show the tag marked by ... in the HTML

It's not safe to remove de anchor of the url. What I mean is that ajax like sites make use of the anchor to keep track of the context. For example gmail. If you go to http://www.gmail.com/#inbox, you go directly to your inbox, but if you go to http://www.gmail.com/#all, you'll go to all your mail.
The server can give a different response based on the anchor, even if the response is a file.

How to use strange characters in a query string

I am using silverlight / ASP .NET and C#. What if I want to do this from silverlight for instance,
// I have left out the quotes to show you literally what the characters
// are that I want to use
string password = vtakyoj#"5
string encodedPassword = HttpUtility.UrlEncode(encryptedPassword, Encoding.UTF8);
// encoded password now = vtakyoj%23%225
URI uri = new URI("http://www.url.com/page.aspx#password=vtakyoj%23%225");
HttpPage.Window.Navigate(uri);
If I debug and look at the value of uri it shows up as this (we are still inside the silverlight app),
http://www.url.com?password=vtakyoj%23"5
So the %22 has become a quote for some reason.
If I then debug inside the page.aspx code (which of course is ASP .NET) the value of Request["password"] is actually this,
vtakyoj#"5
Which is the original value. How does that work? I would have thought that I would have to go,
HttpUtility.UrlDecode(Request["password"], Encoding.UTF8)
To get the original value.
Hope this makes sense?
Thanks.

First lets start with the UTF8 business. Esentially in this case there isn't any. When a string contains characters with in the standard ASCII character range (as your password does) a UTF8 encoding of that string is identical to a single byte ASCII string.
You start with this:-
vtakyoj#"5
The HttpUtility.UrlEncode somewhat aggressively encodes it to:-
vtakyoj%23%225
Its encoded the # and " however only # has special meaning in a URL. Hence when you view string value of the Uri object in Silverlight you see:-
vtakyoj%23"5
Edit (answering supplementary questions)
How does it know to decode it?
All data in a url must be properly encoded thats part of its being valid Url. Hence the webserver can rightly assume that all data in the query string has been appropriately encoded.
What if I had a real string which had %23 in it?
The correct encoding for "%23" would be "%3723" where %37 is %
Is that a documented feature of Request["Password"] that it decodes it?
Well I dunno, you'd have check the documentation I guess. BTW use Request.QueryString["Password"] the presence of this same indexer directly on Request was for the convenience of porting classic ASP to .NET. It doesn't make any real difference but its better for clarity since its easier to make the distinction between QueryString values and Form values.
if I don't use UFT8 the characters are being filtered out.
Aare you sure that non-ASCII characters may be present in the password? Can you provide an example you current example does not need encoding with UTF-8?

If Request["password"] is to work, you need "http://url.com?password=" + HttpUtility.UrlEncode("abc%$^##"). I.e. you need ? to separate the hostname.
Also the # syntax is username:password#hostname, but it has been disabled in IE7 and above IIRC.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.