Understanding string sizes / unicode / sqlServer column sizes

Understanding string sizes / unicode / sqlServer column sizes - c#

I am needing to better understand user text input into win applications / asp.net applications, and what the appropriate field sizes should be for the data stored in SQLServer.
If everything were ASCII, it seems like this would be simple (1 byte for each char), but I guess I really don't understand what is going on when a user puts text into an input field. If the input is in UniCode then there are (generally) 2 bytes per character (?) and if I know a text input can not be cany longer than 5 characters, then should the SQL column be varchar(10)??? How do I know if an input should be in ANSI or Unicode??
Hopefully this makes sense. This is something that I have never fully understood in terms of how a web page or a win app determines how the data is encoded.

When you create a column, you specify the number of characters you need to store, regardless of whether it is Unicode or not. Need up to 5 characters? Then it's either VARCHAR(5) or NVARCHAR(5), depending on whether you actually need Unicode or not - that's a business discussion, not a technical one. The 2 bytes has nothing to do with the column definition - that's about storage size. So a VARCHAR(5) will take 5 bytes if fully populated, and an NVARCHAR(5) will take 10 bytes if fully populated. You don't have to worry about those implementation details when defining the column; however you should be sure that Unicode is required before making the choice, because doubling the space requirement for no reason is wasteful.
(Ignoring arguments about whether such a column should be CHAR/NCHAR, null byte overhead, etc.)

SQL columns are not set by the byte size but the character size. A column of varchar(10) will accept 10 ccharacters. if you going to be taking Unicode input it is best to set nvarchar(10) this will still take 10 characters but it will allow all Unicode input to that column. The same goes for ntext text nchar char. A good MSDN page to understand SQL data types can be found at http://technet.microsoft.com/en-us/library/ms187752.aspx. As for what is going into your text boxes on the ASP.NET site, anything can be inputed into that text box, it is up to you via code to enforce the rules of what you want inputed.

Related

Storing 16 bytes of String array in 4 bytes memory, (compression) in RFID Tags

I hope that this question will not produce some vagueness. Actually I am working on RFID project and I am using Passive Tags. These Tags store only 4 bytes of Data, 32bits. I am trying to store more information in String in Tag's Data Bank. I searched the internet for String compression Algorithms but I didn't find any of them suitable. Someone please guide me through this issue. How can I save more data in this 4 bytes Data Bank, should I use some other strategy for storing, if yes, then what? Moreover, I am using C# on Handheld Window CE device.
I'll appreciate if someone could help me...

It depends on your tag, for example alien tag http://www.alientechnology.com/docs/products/Alien-Technology-Higgs-3-ALN-9662-Short.pdf , has EPC memory , I think you use your EPC memory but You can also use User Memory in your tag. You don't have to compress anything, just use your User Memory. Furthermore, technically I rather not to save many data on my tag, I use my own coding on 32 bit and relates(map) it to the more Data on my Software, and save my data on my Hard Disk. It is more safe too.

There is obviously no compression that can reduce arbitrary 16 byte values to 4 byte values. That's mathematically impossible, check the Pidgeonhole principle for details.
Store the actual data in some kind of database. Have the 4 bytes encode an integer that acts as a key for the row your want to refer to. For example by using an auto-increment primary key, or an index into an array. Works with up to 4 billion rows.

If you have less than 2^32 strings, simply enumerate them and then save the strings index (in your "dictionary") inside your 4 byte "Data Bank".

A compression scheme can't guarantee such high compression ratios.
The only way I can think of with 32-bits is to store an int in the 32-bits, and construct a local/remote URL out of it, which points to the actual data.
You could also make the stored value point to entries in a local look-up table on the device.

Unless you know a lot about the format of your string, it is impossible to do this. This is evident from the pigeonhole principle: you have a theoretical 2^128 different 16-byte strings, but only 2^32 different values to choose from.
In other words, no compression algorithm will guarantee that an arbitrary string in your possible input set will map to a 4-byte value in the output set.
It may be possible to devise an algorithm which will work in your particular case, but unless your data set is sufficiently restricted (at most 1 in 79,228,162,514,264,337,593,543,950,336 possible strings may be valid) and has a meaningful structure, then your only option is to store some mapping externally.

What data type should I use to store text data?

I am making a profile system, I have a field named AboutMe in database, its datatype is text
this field may contain maximum 30,000 characters. The problem is arising is that if i am using up to 27,000 characters(or more than 4,000), they are not shown on UI, instead it truncated the content and show only few characters.
If i use 4000 or more than 4000 character, the UI shows less than 4,000 characters
I am using SQL server 2008 R2 database.

As of SQL Server 2005, you should use VARCHAR(MAX) for non-Unicode text, or NVARCHAR(MAX) for Unicode text (using up 2 bytes per character). TEXT and NTEXT have been deprecated and should not be used anymore.
Those are the current datatypes, and they can be treated just like any other text / string column. All the string functions work on them just fine.
The maximum capacity for each of those column is 2 GByte of data - that's 2 billion characters of non-Unicode or 1 billion Unicode characters.
Considering that a really long book like Tolstoj's War and Peace is probably 5 million characters or less (560'000 words), this would be enough space to store that book at least 200 times in Unicode - should be plenty enough for most applications....

The text data type is deprecated, so you should use varchar(max) instead.
If by UI you mean SQL Managament Studio, it's correct that it won't show large text values. The editor has some limitations like that, for performance reasons.
When you access the data programmatically there is no such limitation. You should however be aware that large text values is sent in a separate data stream, so if you have more than one large text value per record, you have to access them in the order that you select them.

You can use ntext datatype for that.

String gets mysteriously cut off

In my application I use WpfLocalization to provide translations while the application is running. The library will basically maintain a list of properties and their assigned localization keywords and use DependencyObject.SetValue() to update their values when the active language is changed.
The scenario in which I noticed my problem is this: I have a simple TextBlock and have assigned a localization keyword for its Text property. Now when my application starts, it will write the initial value into it and it will display just fine on screen. Now I switch the language and the new value is set as the Text property but only half the text will actually display on screen. Switching the languages back and forth does not have any effect. The first language is always displayed fine, the second is cut off (in the middle of words, but always full characters).
The relative length of both languages to each other does not seem to have anything to do with it. In my test case the working language string is 498 bytes and the one that gets cut off is 439 bytes and gets cut off after 257 bytes).
When I inspect the current value of the Text property of said TextBlock right before I change its value through the localization code, it will always have the expected value (not cut off) in either language.
When inspecting the TextBlock at runtime through WPF Inspector it will display the cut off text as the Text property in the second language.
This makes no sense to me at all thus far. But now it gets better.
The original WpfLocalization library reads the localized strings from standard resource files, but we use a modified version that can also read those string from an Excel file. It does that by opening an OleDbConnection using the Microsoft OLE DB driver and reading the strings through that. In the debugger I can see that all the values are read just fine.
Now I was really surprised when a colleague found the fix for the "cut off text" issue. He re-ordered the rows in the Excel sheet. I don't see how that could be relevant, but switching between the two versions of that file has an impact on the issue.

That does actually make sense, it's because the ole db driver for Excel has to take a sample of the data in a column to assign it a type and in the case of string, also a length. If it only samples values below the 255 character threshold, you will get a string(255) type and truncated text, if it has sampled a longer string, it will assign it as a memo column and allow longer strings to be retrieved / stored. By re-ordering, you are changing which rows are sampled.
If you read the SQL Server to Excel using oledb you will find this is a known issue. http://msdn.microsoft.com/en-us/library/ms141683.aspx - since you are using the same ole db driver, I would expect the situation to also apply to you.
From the docs:
Truncated text.
When the driver determines that an Excel column
contains text data, the driver selects the data type (string or memo)
based on the longest value that it samples. If the driver does not
discover any values longer than 255 characters in the rows that it
samples, it treats the column as a 255-character string column instead
of a memo column. Therefore, values longer than 255 characters may be
truncated. To import data from a memo column without truncation, you
must make sure that the memo column in at least one of the sampled
rows contains a value longer than 255 characters, or you must increase
the number of rows sampled by the driver to include such a row. You
can increase the number of rows sampled by increasing the value of
TypeGuessRows under the
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel registry
key. For more information, see PRB: Transfer of Data from Jet 4.0
OLEDB Source Fails w/ Error.

decrypt an encrypted value?

I have an old Paradox database (I can convert it to Access 2007) which contains more then 200,000 records. This database has two columns: the first one is named "Word" and the second one is named "Mean". It is a dictionary database and my client wants to convert this old database to ASP.NET and SQL.
However, we don't know what key or method is used to encrypt or encode the "Mean" column which is in the Unicode format. The software itself has been written in Delphi 7 and we don't have the source code. My client only knows the credentials for logging in to database. The problem is decoding the Mean column.
What I do have is the compiled windows application and the Paradox database. This software can decode the "Mean" column for each "Word" so the method and/or key is in its own compiled code(.exe) or one of the files in its directory.
For example, we know that in the following row the "Zymurgy"
exactly means "مبحث عمل تخمیر در شیمی علمی, تخمیر شناسی" since the application translates it like that. Here is what the record looks like when I open the database in Access:
Word Mean
Zymurgy 5OBnGguKPdDAd7L2lnvd9Lnf1mdd2zDBQRxngsCuirK5h91sVmy0kpRcue/+ql9ORmP99Mn/QZ4=
Therefore we're trying to discover how the value in the Mean column is converted to "مبحث عمل تخمیر در شیمی علمی, تخمیر شناسی". I think the "Mean" column value in above row is encoded in Base64 string format, but decoding the Base64 string does not yet result in the expected text.
The extensions for files in the win app directory are dll, CCC, DAT, exe (other than the main app file), SYS, FAM, MB, PX, TV, VAL.
Any kind of help is appreciated.
here is two more example and remember double quotes at start and end are not part of the strings:
word: "abdominal"
coded value: "vwtj0bmj7jdF9SS8sbrIalBoKMDvTbpraFgG4gP/G9GLx5iU/E98rQ=="
translation in Farsi: "شکمی, بطنی, وریدهای شکمی, ماهیان بطنی"
word: "cart"
coded value: "KHoCkDsIndb6OKjxVxsh+Ti+iA/ZqP9sz28e4/cQzMyLI+ToPbiLOaECWQ8XKXTz"
translation in Farsi: "ارابه, گاری, دوچرخه, چرخ, با گاری بردن"
here is the result in different encodings:
1- in unicode the result is: "ᩧ訋퀽矀箖�柖�섰᱁艧껀늊螹泝汖銴岔꫾也捆￉鹁"
2- in utf32 the result is: "��������������"
3- in utf7 the result is: "äàg\v=ÐÀw²ö{Ýô¹ßÖg]Û0ÁAgÀ®²¹ÝlVl´\\¹ïþª_NFcýôÉÿA"
4- in utf8 the result is: "��g\v�=��w���{����g]�0�Ag��������lVl���\\����_NFc����A�"
5- in 1256 the result is: "نàg\vٹ=ذہw²ِ–{فô¹كضg]غ0ءAg‚ہ®ٹ²¹‡فlVl´’”\\¹ï‏ھ_NFc‎ôةےA"
yet i discovered that the paradox database system is very complex when it comes to key management and most of the time the keys are "compound keys" and that's why it's problematic and that's why it's abandoned!
UPDATE: i'm trying to do the automation by using AutoIt v3 because the decryption process as i understand can't be done in one or two days. now i have another problem which is related to text/font. when i copy the translated text to notepad it will change to some unrecognizable text unless i change the font of notepad to the font of the translation software. if i type something in the notepad in Farsi it will show it correctly regardless of what font i've been chosen. more interesting is when i copy the text to any other program like MS Office Word it'll be shown correctly no matter what font i choose.
so how can i get around this ?

In this situation, I would think about writing a script/program to simply pull all the data out through the existing program.
You could write an application to send keypresses to the app which would select and copy each value in turn.
It would take a while to run, but you could just leave it overnight (how big is your database?) and it only has to run once.
Not sure how easy this would be, since I haven't seen this app of course - might this work?

Take a debugger like ollydbg/softice. Find the place where the mean is decoded/encoded and then step through the instructions one by one, check all registers to find out what is done. I have done so numerous times. That should help you getting started, since you have the application which is able to decode this stuff. You also have a reference word. That's all you need.
Also take into consideration: Unicode can be Little or Big Endian. So you might try swapping the bytes. UTF-8 can be a pain, since some words are stored as one byte and some as two bytes.
You can also try to take words which are almost identical in Farsi and try to compare the outputs. That could lead to a reconstruction of a custom code page, if there is one.

ESE column type to XmlSerialize arbitrary objects

What's the best ESE column type to XmlSerialize an object to my ESE DB?
Both "long binary" and "long ASCII text" work OK.
Reason for long binary: absolutely sure there's no characters conversation.
Reason for long text: the XML is text.
It seems MSDN says the 2 types only differ when sorting and searching. Obviously I'm not going to create any indices over that column; fields that need to be searchable and/or sortable are stored in separate columns of appropriate types.
Is it safe to assume any UTF8 text, less then 2GB in size, can be saved to and loaded from the ESE "long ASCII text" column value?

Yes you can put up to 2GB of data of UTF8 text into any long text/binary column. The only difference between long binary and long text is the way that the data is normalized when creating an index over the column. Other than that ESE simply stores the provided bytes in the column with no conversion. ESE can only index ASCII or UTF16 data and it is the application's responsibility to make sure the data is in the correct format so it would seem to be more correct to put the data into a long binary column. As you aren't creating an index there won't actually be any difference.
If you are running on Windows 7 or Windows Server 2008 R2 you should investigate column compresion. For XML data you might get significant savings simply by turning compression on.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.