information on gotchas for multi lingual application [closed]

information on gotchas for multi lingual application [closed] - c#

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am currently working on .net 4.5 application that contains multi lingual data.
I am new to this so I am looking for resources that explain concepts such as encoding for different languages, globalization, localization etc.
Any tips as to where I should look for such information?

MSDN - as always - is the best resource: http://msdn.microsoft.com/en-us/library/h6270d0z.aspx .
Some gotchas from my own experience:
Use unicode types in your database. So for SQL Server, make your text types nvarchar, ntext instead of varchar, text to have them as unicode. Otherwise you will lose information in languages such as Chinese
Make your design flexible, a phrase that is 10 characters in English could easily be 3-4 times as big in German or French, make your buttons flexible (sliding door technique for example for html), make your width and heights percentages and as responsive as possible.
In your resource files, have plural and singular forms of strings with placeholders for numbers, for example, if you have a phrase stating "within 2 km of this place" then you will probably need a resource entry for Km separately from the whole sentence for scenarios of singular/plural (kilometers, kilometer) don't assume that you could just add an "s" for pluralization. That won't work in all languages. Some languages even have a special case for singular, plural and for two objects that are not treated the same as plural (i.e. arabic) (Look at Dwayne's comment for an interesting intake on this point)
If you're going to localize for a language such as Arabic or Hebrew, then these are right to left, your whole design (including pictures) will need to change orientation. In HTML, that's as easy - mostly - as having a "dir: rtl" attribute, but sometimes it can be tricky.
It's not just about translation. Things that will change include number formats, using comma seperators or periods for decimal points and thousands, currency symbols coming before or after, currency formatting, date formatting etc... Make sure that all of these are formatted by .net framework using the culture of the current user.
Be disciplined about not hardcoding any strings in your UI. A handy trick is to have a resource language for a language that doesn't use latin characters (Chinese, Russian, Arabic whatever), create a resource file for that language and fill all entries with random string from Google in that language. Run your application, and you will be able to easily spot the parts of the UI that are not coming from the resource file (they will be the english characters in the middle of the Chinese ones).
It is not just about the UI. If you are sending messages from the backend, like a response from a service or so on, that also needs to be localized. In some cases, even error messages logged in the Event log are required to be localized. Make sure you think about that.
Javascript. If you're doing ajaxified web with heavy javascript, you might need to use a library such jquery localization to help with localization. You will have to serve your resource file in a JS key-value kind of structure. Since this is less standard than ASP.NET, it could require some improvisation from your side depending on your needs (decisions such as how to load these files with resources, all-at-once or with AMD, or may be create a service that returns the localized strings, or just let asp.net bind the values from the actual resource file at compile time etc...)

Related

Is there a better permalink solution [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am developing a website in C# and ASP.NET MVC where people can manage their own web pages. At the moment I am using the permalink solution of StackOverflow but I am not sure if this will work in my situation because people will add and delete pages constantly. This means that the id in the pages table will grow very large.
Example: mydomain.com/page/17745288223/my-page-title
Is there a better solution?

I think that for your case (users creating pages) it's actually more user friendly to put all pages created by a single user under his/her own path i.e.:
mydomain.com/page/{username/nickname/some-name-selected-by-user}/my-page-title
If you don't want to use such format an int or long in URL will probably do.

Well, you could use some kind of a hash to make lookups more efficient. You could, for instance compute a SHA-1 hash of page title, creating date, user information, etc. - just like git does for commit ids.
Or you could use simple numbers, but convert them into some compact representation using hexadecimal numbers or alphanumerical characters like some url-shortening services.

Though this started as a comment I decided it was growing larger so here it is again..
The page id solution seems just fine.
What are you worried about? If you are expecting a few million pages that's 7 characters. If you are expecting more than a few billion pages that's 9 - 10 characters.. Pretty manageable, I think.
You could also represent it as hex and reduce it to a maximum of 8 characters to fit up to 2^32 different ids.

This means that the id in the pages table will grow very large.
What's the problem with that?
The largest value for an int is also very large (just over 2 billion) so I doubt it will hit any limit unless you are planning to have millions of users with thousands of pages each.
If you are still worried then you can use a long (64-bit integer). It can handle trillions of users with millions of pages each. Note that the population of the Earth is only a few billion.

Why generated code in C# uses underscore? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know this might be a stupid question, but here it goes.
I always wrote my private members like privateMember and I've been reading a lot about naming conventions in C# because I noticed that a lot of the automatic generated code in visual studio use _variableName for private members. Everywhere I read, even in Microsoft documents, that you should use privateMember.
So, my question is, if the good practices says that I should write privateMember, as I do now, Why the heck Visual Studio generates classes with private members using underscore (_privateMember)?

Microsoft Code Conventions actually recommend against using underscores altogether. It is really personal preference. I would not use generated code as inspiration for my coding convention standard.
Do not use underscores, hyphens, or any other nonalphanumeric characters.
Maybe it's because it's generated code and not intended to be read by humans. ;-)

Not so long time ago when C# was raising to the market there was a concept that local variables should be leaded by a prefix _. This concept was not accepted by the community as in pure C the _ leads system variable/functions and the metadata are lead by __. So after few years, they now discourage to use that. But still you will find some believer that use this notation not because it is a fanatic but a lot of old C# applications contain this convention.
Why this is in VisualStudio ?
This might be related to the time gap it was designed. In those time this approach was suggested by language designers. So it is probably that no one changed that in the configuration for latest version.

Naming conventions aren't 100% agreed upon. This is one of those that some people like, some people are indifferent to, and some people hate. Certain people consider it better for instance variables to stand out, via their name, and this is one way to do that. Other people use this.instanceVariable rather than instanceVariable all of the time so that instance variables stand out, other people prepend something other than a '_' character, and some people just don't go out of their way to use any special distinction.
At the end of the day what's important is that you, and the other members of your team agree on a standard and are consistent with it. What the rest of the world chooses to do doesn't need to affect you.
It's also worth mentioning that the code snippets generated by Visual Studio, in most cases, can be configured to be in line with your team's coding practices.

It's just a convention they use, I do it too. You can ultimately name your private fields whatever you want. Prefixing it with an underscore just makes it easier to read IMO.

As a convention private fields were/is used as with underscore e.g. string _name;
The link will give you more info on guidelines for naming coventions by MS http://msdn.microsoft.com/en-us/library/ms229045.aspx

It's just the C# language convention so that in constructor you can use _varable instead of this.variable, when the constructor and field name is the same.
there are all c# naming conventions in
http://msdn.microsoft.com/en-us/library/ms229002.aspx
It's a mather of you if you follow the convention of the generated code.
Besisdes the recomendations many programmers use the same convention as the generated code.
Some programs that help you refactor the code also sugests you to follow that name convention for field names.

The underscore at the beginning is VS's way of showing that it is a privateMember. We keep the underscore at the beginning as a rule, but it is really a personal preference as to what naming convention you use. Just pick one and stick with it so you don't confuse yourself or anyone else that might look at your code.

Artificial Intelligence, Text Classifier [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am new to AI. I am working an application that text classification via machine learning. The application needs to classify different parts of an HTML document. For example, most webpages have head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document, and to identify different type of forms on the page.
It would be very helpful if anyone could provide detailed guidance on this subject.
Examples of similar application, would also be very helpful.
I am looking for more technical suggestions, relating to code & implementation.
I can assign labels to html tag attributes, like class or id
<div class="menu-1">
<div id="entry">
<div id="content">
<div id="footer">
<div id="comment-12">
<div id="comment-title">
like for first item:
TrainClassifier(label: "Menu", value: "menu-1", attribute: "class", position-in-string: "21%", tag: "div");
Inputs:
"menu-1" (attribute value)
List item
"class" (attribute name)
"21" (tag position in string)
"div" (tag name)
Output
"Menu" (classified as label)
What neural network library, can take the above inputs, and classify them in to labels (i.e. Menu).
All users cannot create regex, or xpath, they need more easy approach, so it is important, to make the software intelligent, user can highlight the part of html document he/she needs, using webbrowser control, and train the software till it can work on its own.
but I dont know how to make the software train using AI,
the AI I am looking for is, like it should be able to accept various inputs, and classify on the basis of that, as I have already said new to AI, don't know much about it.
It would be helpful to me if I get answer to the question I have asked, like what library I should use, and how to implement, answers suggesting Xpath or Regex or other methods pls don't answer, it often happens that you get all suggestions but the one you need.

I suggest you to look into simpler algorithms first which are easy to understand, I can give pointers to some.
Naive Bayes (you will find many implementations but you can do it yourself, the algo is simple to implement yet quite powerful).
Maximum Entropy (Eg. SharpMaxEnt - open source).
SVM (Eg. LibSVM for C# port).
If you want to get a taste of how these work, download the WEKA toolkit:
http://sourceforge.net/projects/weka/
The commonly followed steps are usually the following:
Identify as many attributes/features as you can get (and a set of labels).
Collect data which is a set { Label, Attribute1, A2, A3, ... }
Select a minimal set of important attributes using feature selection algorithms (also available in the WEKA toolkit)
Train the classifier using standard algorithm
Test the system, until you receive the desired accuracy,recall, or other params.
Good Luck!

This is a very broad topic. There are a few neural network libraries out there for C#, just search for them on Stack Overflow.
You will need to perform supervised training before you can do any type of classification. In order for the ANN to understand what you are throwing at it, you will need to figure out how you will parse the HTML to get the results you are looking for.
As an example, most websites will use CSS to render content on a browser. Other sites may use tables. You will need to train for both.
Your problem is not an easy one.

Classification could help you, if you had pieces of data that you had to assign labels to. This is not the case. You would be better off manually writing out XPath rules for taking apart your documents.

isn't number localization just unnecessary? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've just read this page http://weblogs.asp.net/scottgu/archive/2010/06/10/jquery-globalization-plugin-from-microsoft.aspx
One of the things they did was to convert the arabic date to the arabic calendar. I'm wondering if it is a good idea at all to do so. Will it actually be annoying/confusing for the user (even if the user is Arabic).
Also, my second question is that do we really need to change 3,899.99 to 3.899,99 for some cultures like German? I mean it doesn't hurt to do so since the library already does it for us but wouldn't this actually cause more confusion to the user (even if he is German).
I'm sure whatever culture these people come from, if i give you a number 3,899.99 there's no way you'd get that wrong right? (since he'd probably learned the universal format anyway)

Your problem here seems to be a bad assumption. There is no "universal format" for numbers. 3,899.99 is valid in some places, and confusing in others. Same for the converse. People can often figure out what they need to (especially if it's in software that is clearly doing a shoddy job of localization otherwise. :) ), but that's not the point.
Except in certain scientific and technical domains that general software doesn't usually address, there's no universal format for any of these things. If you want your software to be accepted on native terms anywhere but your own place, you'll need to work for it.

To me it seems like it would be much less confusing to see dates and numbers in the format you're used to (in your country or language) - why do you think it would be the other way around?

The point of localization is to make your application look more natural for the user. It is definitely advisable to do this in your application if you use it internationally. While you can use US standards, that is not very customer-friendly way of doing things.
How would it be more confusing to a person to see the format they are familiar with? Meet people where they are with your application. If their standard is 10.000,00 and you are showing them 10,000.00, even if they understand it, it does make it a bit disconcerting. Reverse the situation and think what you would like. Would you like a developer using 10.000,00 for their application because you can understand it just fine?

Depends. 3.899,99 to me looks like two numbers. 3.899 and 99. I imagine our number formatting looks similarly funny to foreigners. Sure, I could guess what it means here, but what if you had a whole bunch of numbers like this clustered together? The winning lotto numbers are 45,26,21,56,94,13. Is that one big number, or 6 2-digit numbers?
Date formatting is especially important. 01/02/03. Is that Jan 2 2003, Feb 1 2003, Feb 3 2001 or what? Different cultures specify the d/m/y in different orders. Also, when spelled out, they obviously have different names for the months.
If you have the time and resources to internationalize it, I think you should.

As a foreigner myself, I can assure you that localization helps a lot in terms of user satisfaction. Commas or dots in numbers may induce big mistakes. Another on is the relative position of days and months.
To improve even further, create translations and add an option to choose locale. That way you will have close to 100% customer satisfaction

another important thing is input. if you don't have localization, take the user input "1.234"... what does the user mean? 1.234 or 1234 ? ... there may be users that don't like their values to be off by factor 1000 ... who knows? ;)

What are the best practices for handling Unicode strings in C#? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Can somebody please provide me some important aspects I should be aware of while handling Unicode strings in C#?

Keep in mind that C# strings are sequnces of Char, UTF-16 code units. They are not Unicode code-points. Some unicode code points require two Char's, and you should not split strings between these Chars.
In addition, unicode code points may combine to form a single language 'character' -- for instance, a 'u' Char followed by umlat Char. So you can't split strings between arbitrary code points either.
Basically, it's mess of issues, where any given issue may only in practice affect languages you don't know.

C# (and .Net in general) handle unicode strings transparently, and you won't have to do anything special unless your application needs to read/write files with specific encodings. In those cases, you can convert managed strings to byte arrays of the encoding of your choice by using the classes in the System.Text.Encodings namespace.

System.String already handled unicode internally so you are covered there. Best practice would be to use System.Text.Encoding.UTF8Encoding when reading and writing files. It's more than just reading/writing files however, anything that streams data out including network connections is going to depend upon the encoding. If you're using WCF, it's going to default to UTF8 for most of the bindings (in fact most don't allow ASCII at all).
UTF8 is a good choice because while it still supports the entire Unicode character set, for the majority of the ASCII character set it has a byte similarity. Thus naive applications that don't support Unicode have some chance of reading/writing your applications data. Those applications will only begin to fail when you start using extended characters.
System.Text.Encoding.Unicode will write UTF-16 which is a minimum of two bytes per character, making it both larger and fully incompatible with ASCII. And System.Text.Encoding.UTF32 as you can guess is larger still. I'm not sure of the real-world use case of UTF-16 and 32, but perhaps they perform better when you have large numbers of extended characters. That's just a theory, but if it is true, then Japanese/Chinese developers making a product that will be used primarily in those languages might find UTF-16/32 a better choice.

Only think about encoding when reading and writing streams. Use TextReader and TextWriters to read and write text in different encodings. Always use utf-8 if you have a choice.
Don't get confused by languages and cultures - that's a completely separate issue from unicode.

.Net has relatively good i18n support. You don't really need to think about unicode that much as all .Net strings and built-in string functions do the right thing with unicode. The only thing to bear in mind is that most of the string functions, for example DateTime.ToString(), use by default the thread's culture which by default is the Windows culture. You can specify a different culture for formatting either on the current thread or on each method call.
The only time unicode is an issue is when encoding/decoding strings to and from bytes.

As mentioned, .NET strings handle Unicode transparently. Besides file I/O, the other consideration would be at the database layer. SQL Server for instance distinguishes between VARCHAR (non-unicode) and NVARCHAR (which handles unicode). Also need to pay attention to stored procedure parameters.

More details can be found on this thread:
http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.