String gets mysteriously cut off - c#

In my application I use WpfLocalization to provide translations while the application is running. The library will basically maintain a list of properties and their assigned localization keywords and use DependencyObject.SetValue() to update their values when the active language is changed.
The scenario in which I noticed my problem is this: I have a simple TextBlock and have assigned a localization keyword for its Text property. Now when my application starts, it will write the initial value into it and it will display just fine on screen. Now I switch the language and the new value is set as the Text property but only half the text will actually display on screen. Switching the languages back and forth does not have any effect. The first language is always displayed fine, the second is cut off (in the middle of words, but always full characters).
The relative length of both languages to each other does not seem to have anything to do with it. In my test case the working language string is 498 bytes and the one that gets cut off is 439 bytes and gets cut off after 257 bytes).
When I inspect the current value of the Text property of said TextBlock right before I change its value through the localization code, it will always have the expected value (not cut off) in either language.
When inspecting the TextBlock at runtime through WPF Inspector it will display the cut off text as the Text property in the second language.
This makes no sense to me at all thus far. But now it gets better.
The original WpfLocalization library reads the localized strings from standard resource files, but we use a modified version that can also read those string from an Excel file. It does that by opening an OleDbConnection using the Microsoft OLE DB driver and reading the strings through that. In the debugger I can see that all the values are read just fine.
Now I was really surprised when a colleague found the fix for the "cut off text" issue. He re-ordered the rows in the Excel sheet. I don't see how that could be relevant, but switching between the two versions of that file has an impact on the issue.

That does actually make sense, it's because the ole db driver for Excel has to take a sample of the data in a column to assign it a type and in the case of string, also a length. If it only samples values below the 255 character threshold, you will get a string(255) type and truncated text, if it has sampled a longer string, it will assign it as a memo column and allow longer strings to be retrieved / stored. By re-ordering, you are changing which rows are sampled.
If you read the SQL Server to Excel using oledb you will find this is a known issue. http://msdn.microsoft.com/en-us/library/ms141683.aspx - since you are using the same ole db driver, I would expect the situation to also apply to you.
From the docs:
Truncated text.
When the driver determines that an Excel column
contains text data, the driver selects the data type (string or memo)
based on the longest value that it samples. If the driver does not
discover any values longer than 255 characters in the rows that it
samples, it treats the column as a 255-character string column instead
of a memo column. Therefore, values longer than 255 characters may be
truncated. To import data from a memo column without truncation, you
must make sure that the memo column in at least one of the sampled
rows contains a value longer than 255 characters, or you must increase
the number of rows sampled by the driver to include such a row. You
can increase the number of rows sampled by increasing the value of
TypeGuessRows under the
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel registry
key. For more information, see PRB: Transfer of Data from Jet 4.0
OLEDB Source Fails w/ Error.

Related

ExcelDataReader.AsDataSet() converts single fraction double value into multiple fractions

I'm facing a problem when reading the excel-sheet data using ExcelDataReader in c#.
I am reading data from excel-sheet(.xlsm)
One of the cell has a list of values to choose.
Eg.
5.1
5.2
5.1a
When I choose the value either 5.2 or 5.1a and read, I get the same exact value in the dataset
But when I choose 5.1 and read, I get 5.0999999999999996 in the dataset
Here is the code which I used to read the data in c#,
IExcelDataReader excelReader = ExcelReaderFactory.CreateOpenXmlReader(fileStream);
DataSet findingsData = excelReader.AsDataSet();
Note :
For a workaround, I put a space after the value 5.1 in the cell. Then it read the value exactly same as expected(5.1 instead of 5.0999999999999996).
But I'm wondering, when it read the value 5.2 exactly same without applying any space, why doesn't work for 5.1?
Any suggestions are welcome to resolve this issue...
Thanks,
Karthik
Take a look at this question: Why can't decimal numbers be represented exactly in binary?
My maths isn't quite up to figuring it out precisely (comments welcome) but I suspect that 5.1 doesn't convert to the C# double precisely, but 5.2 does.
The reason it works when you add the space is that Excel will assume that the field is text, the same way 5.1a is, but when it receives something that looks like a number it assumes it is a number. (You can see this behaviour in a default blank spreadsheet as it will be right aligned if it is a number and left aligned when you add a space or any other text).
I expect that if you explicitly format all the cells as text in your source spreadsheet then the value will be read as you expect

Data truncated after 255 bytes while using Microsoft.Ace.Oledb.12.0 provider

I am reading an excel sheet using the ACE provider and certain cells contain data greater than 255 bytes. I tried changing the TypeGuessRows in the registry settings as well as setting the same from the connection string. Still I get the truncated value in the code. I am not in a position to restructure the excel sheet or use another provider. I run 64 bit windows. My office edition is 2013. (Have a small doubt if it is because of this).
This is my connection string; it is working fine for those cells having data < 255 bytes.
var connectionString = string.Format("provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + fileName + ";Extended Properties=\"Excel 12.0;IMEX=1;HDR=YES;TypeGuessRows=0;ImportMixedTypes=Text\"");
Any solutions? Thanks in advance.
I am also using Microsoft.ACE.OLEDB.12.0 on 64-bit Windows 7.
I found that the TypeGuessRows in the connection string has no effect.
But increasing the TypeGuessRows in the following registry location works:
HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel
More info on a similar bug (although you may already know this as you're already trying to change TypeGuessRows)
The solution to this was extremely simple.
Just change the format of the column containing this huge data to "Text" from "General" in the excel sheet.
Now I feel like a n00b.
refer this link. I think this is the problem (try with Memo fields)
http://allenbrowne.com/ser-63.html
In Access tables, Text fields are limited to 255 characters,but Memo fields can handle 64,000 characters (about 8 pages of single-spaced text)
Nice workaround: have a look at this stack answer
The problem is that the ACE driver is inferring a TEXT data type for the column you're populating the data set from. Text columns are limited to 255 characters. You need to force it to use the MEMO data type.
Your best bet for that is to garantee that the majority of the first eight rows in that column exceed 255 characters in length.
Source
This behavior is determined by the the predictive nature of the Excel
driver/provider. Since it doesn't know what the data types are, it has
to make a guess based upon the data in the first several rows. If the
contents of a field exceeds 255 characters, and it's in the first
several rows, then the data type will be Memo, otherwise it will
probably be Text (which will result in the truncation).
Excel has some limits.
Excel specifications and limits - 2013
As you can see in the link posted:
Feature Maximum Limit
Column width 255 characters

What data type should I use to store text data?

I am making a profile system, I have a field named AboutMe in database, its datatype is text
this field may contain maximum 30,000 characters. The problem is arising is that if i am using up to 27,000 characters(or more than 4,000), they are not shown on UI, instead it truncated the content and show only few characters.
If i use 4000 or more than 4000 character, the UI shows less than 4,000 characters
I am using SQL server 2008 R2 database.
As of SQL Server 2005, you should use VARCHAR(MAX) for non-Unicode text, or NVARCHAR(MAX) for Unicode text (using up 2 bytes per character). TEXT and NTEXT have been deprecated and should not be used anymore.
Those are the current datatypes, and they can be treated just like any other text / string column. All the string functions work on them just fine.
The maximum capacity for each of those column is 2 GByte of data - that's 2 billion characters of non-Unicode or 1 billion Unicode characters.
Considering that a really long book like Tolstoj's War and Peace is probably 5 million characters or less (560'000 words), this would be enough space to store that book at least 200 times in Unicode - should be plenty enough for most applications....
The text data type is deprecated, so you should use varchar(max) instead.
If by UI you mean SQL Managament Studio, it's correct that it won't show large text values. The editor has some limitations like that, for performance reasons.
When you access the data programmatically there is no such limitation. You should however be aware that large text values is sent in a separate data stream, so if you have more than one large text value per record, you have to access them in the order that you select them.
You can use ntext datatype for that.

decrypt an encrypted value?

I have an old Paradox database (I can convert it to Access 2007) which contains more then 200,000 records. This database has two columns: the first one is named "Word" and the second one is named "Mean". It is a dictionary database and my client wants to convert this old database to ASP.NET and SQL.
However, we don't know what key or method is used to encrypt or encode the "Mean" column which is in the Unicode format. The software itself has been written in Delphi 7 and we don't have the source code. My client only knows the credentials for logging in to database. The problem is decoding the Mean column.
What I do have is the compiled windows application and the Paradox database. This software can decode the "Mean" column for each "Word" so the method and/or key is in its own compiled code(.exe) or one of the files in its directory.
For example, we know that in the following row the "Zymurgy"
exactly means "مبحث عمل تخمیر در شیمی علمی, تخمیر شناسی" since the application translates it like that. Here is what the record looks like when I open the database in Access:
Word Mean
Zymurgy 5OBnGguKPdDAd7L2lnvd9Lnf1mdd2zDBQRxngsCuirK5h91sVmy0kpRcue/+ql9ORmP99Mn/QZ4=
Therefore we're trying to discover how the value in the Mean column is converted to "مبحث عمل تخمیر در شیمی علمی, تخمیر شناسی". I think the "Mean" column value in above row is encoded in Base64 string format, but decoding the Base64 string does not yet result in the expected text.
The extensions for files in the win app directory are dll, CCC, DAT, exe (other than the main app file), SYS, FAM, MB, PX, TV, VAL.
Any kind of help is appreciated.
here is two more example and remember double quotes at start and end are not part of the strings:
word: "abdominal"
coded value: "vwtj0bmj7jdF9SS8sbrIalBoKMDvTbpraFgG4gP/G9GLx5iU/E98rQ=="
translation in Farsi: "شکمی, بطنی, وریدهای شکمی, ماهیان بطنی"
word: "cart"
coded value: "KHoCkDsIndb6OKjxVxsh+Ti+iA/ZqP9sz28e4/cQzMyLI+ToPbiLOaECWQ8XKXTz"
translation in Farsi: "ارابه, گاری, دوچرخه, چرخ, با گاری بردن"
here is the result in different encodings:
1- in unicode the result is: "ᩧ訋퀽矀箖�柖�섰᱁艧껀늊螹泝汖銴岔꫾也捆￉鹁"
2- in utf32 the result is: "��������������"
3- in utf7 the result is: "äàg\v=ÐÀw²ö{Ýô¹ßÖg]Û0ÁAgÀ®²¹ÝlVl´\\¹ïþª_NFcýôÉÿA"
4- in utf8 the result is: "��g\v�=��w���{����g]�0�Ag��������lVl���\\����_NFc����A�"
5- in 1256 the result is: "نàg\vٹ=ذہw²ِ–{فô¹كضg]غ0ءAg‚ہ®ٹ²¹‡فlVl´’”\\¹ï‏ھ_NFc‎ôةےA"
yet i discovered that the paradox database system is very complex when it comes to key management and most of the time the keys are "compound keys" and that's why it's problematic and that's why it's abandoned!
UPDATE: i'm trying to do the automation by using AutoIt v3 because the decryption process as i understand can't be done in one or two days. now i have another problem which is related to text/font. when i copy the translated text to notepad it will change to some unrecognizable text unless i change the font of notepad to the font of the translation software. if i type something in the notepad in Farsi it will show it correctly regardless of what font i've been chosen. more interesting is when i copy the text to any other program like MS Office Word it'll be shown correctly no matter what font i choose.
so how can i get around this ?
In this situation, I would think about writing a script/program to simply pull all the data out through the existing program.
You could write an application to send keypresses to the app which would select and copy each value in turn.
It would take a while to run, but you could just leave it overnight (how big is your database?) and it only has to run once.
Not sure how easy this would be, since I haven't seen this app of course - might this work?
Take a debugger like ollydbg/softice. Find the place where the mean is decoded/encoded and then step through the instructions one by one, check all registers to find out what is done. I have done so numerous times. That should help you getting started, since you have the application which is able to decode this stuff. You also have a reference word. That's all you need.
Also take into consideration: Unicode can be Little or Big Endian. So you might try swapping the bytes. UTF-8 can be a pain, since some words are stored as one byte and some as two bytes.
You can also try to take words which are almost identical in Farsi and try to compare the outputs. That could lead to a reconstruction of a custom code page, if there is one.

ESE column type to XmlSerialize arbitrary objects

What's the best ESE column type to XmlSerialize an object to my ESE DB?
Both "long binary" and "long ASCII text" work OK.
Reason for long binary: absolutely sure there's no characters conversation.
Reason for long text: the XML is text.
It seems MSDN says the 2 types only differ when sorting and searching. Obviously I'm not going to create any indices over that column; fields that need to be searchable and/or sortable are stored in separate columns of appropriate types.
Is it safe to assume any UTF8 text, less then 2GB in size, can be saved to and loaded from the ESE "long ASCII text" column value?
Yes you can put up to 2GB of data of UTF8 text into any long text/binary column. The only difference between long binary and long text is the way that the data is normalized when creating an index over the column. Other than that ESE simply stores the provided bytes in the column with no conversion. ESE can only index ASCII or UTF16 data and it is the application's responsibility to make sure the data is in the correct format so it would seem to be more correct to put the data into a long binary column. As you aren't creating an index there won't actually be any difference.
If you are running on Windows 7 or Windows Server 2008 R2 you should investigate column compresion. For XML data you might get significant savings simply by turning compression on.

Categories