Japanese culture in C# application in English OS - c#

I have an application in C# using .net 3.5. Using this application I save an file and zip it using vjlib library and while opening the file I unzip it. However when I try to give file name as Japanese when saving it in English OS machine the file while opening it the application it not able to understand Japanese character. It due to some Windows Language pack etc.

This problem is most likely induced by the application that created the .zip file. File names are encoded in 8 bit characters inside the file. The ZIP specification says that the name should be encoded either in code page 437 or in utf-8. Code page 437 is the original IBM PC character set, an encoding that doesn't support any Japanese characters. It is not unusual for an app to just use its own 8-bit encoding, not untypically determined by the default system code page.
The library you use is the .NET runtime support library for JScript. Not sure it support specifying a different encoding, it is hard to find docs for it these days since it has been deprecated for so long. Consider, say, dotnetzip. Its ZipFile class has an AlternateEncoding property you can initialize from Encoding.GetEncoding(). You still need to find out what encoding was used, knowing where the file came from is important help to make the right guess. One common code page for Japanese is 932.

Related

Determine File Encoding for Large Files in SSIS / C#

I have very large text files that are being imported via SSIS into our database.These files come from hundreds of companies and a variety of different source systems. Most of these files are fine importing with code page 1252, but in some files, buried somewhere in one of the rows, there might be some oddball characters that don't fit in the 1252 code page.
I've implemented a solution based on this SO answer, which allows me to proceed with code page 1252 on one path if the file's encoding is ANSI/ASCII, OR it can go down another path with a 65001 code page. This seems to work in a lot of cases, but is not reliable enough to be something we could use in production.
using(var r = new StreamReader(filename, Encoding.Default))
{
richtextBox1.Text = r.ReadToEnd();
var encoding = r.CurrentEncoding;
}
I'm far from an expert on file encoding, but I'm guessing that it's because it only reads a certain portion of the file and if everything looks like ANSI characters, it will assume it is ANSI (these files are almost guaranteed not to have a BOM)?
Would I have to read the entire file into memory and examine every character to come to a mostly accurate file encoding? How can I do this when reading an extremely large file into memory would cause huge problems?
Is there a way to accomplish this with a reasonable level of certainty? I don't need to account for any kind of foreign languages as these are all English, but we've encountered the occasional strange character included in these files. I'm thinking we need to allow for ASCII, UTF-8 and UTF-16.
Is there a way to just be able to determine whether to use code page 1252 or 65001 in SSIS?
On a related note, if ASCII is a subset of UTF-8, why is it that when I import ALL the files as code page 65001, some of the characters don't translate correctly? Shouldn't UTF-8 work for everything if it encompasses ASCII?

How to get RDLC report render PDF with ToUnicode entry for copypasting non-ansi text from the resulting PDF

Preface: we have a reports generated in c# application using Microsoft.Reporting.WebForms. LocalReport class from a RDLC file. They are rendered in PDF format. The text in the report is mostly in Cyrillics. The problem is: it's impossible to copy it from the resulting PDF file, you get garbage.
The reason you get garbage is the text is written as the "Identity-H" encoding for the font. It's not a real encoding, it's just an assignment of CIDs (basically, numbers) for glyphs used in the PDF file. Adobe's PDF format has the "ToUnicode" entry for this reason – that's what should store the correspondence of CIDs to the Unicode characters. If this information was present, it would be possible to copy/past text from the file correctly.
Obviously, this class doesn't write it. While researching the problem, I came across this page that recognizes the lack of copy/paste support and praises it finally being implemented... in SQL Server 2016 Reporting Services.
Well, we don't use ServerReport class and SQL Server RS. Or SQL Server 2016. It'll be kinda a weird and way too giant architectural changes to move to it just because managers complain they cannot copy text from PDFs.
So, is there a workaround? I doubt noone faced this problem before. Maybe the writing of this ToUnicode entry was implemented in LocalReport in the newer version of dotNet? Did someone write some sort of wrapper classes that take a bytearray of the PDF and enhance it? Or maybe people render the report to DOCX and then use some other library to make a PDF out of that correctly?

Opening a Unix file in Windows Notepad++?

I receive a file from a supplier that I download per SFTP. Our systems are all working on Windows.
When I open the File in Notepad++ the status bar says "UNIX" and "UTF-8"
The special characters aren't displayed correctly.
I tried to convert the file to the different formats Notepad++ allows but no one converted the char 'OSC' to the german letter 'ä'. Is this a known Unix-Windows-thing? My google-foo obviously isn't good enough.
Which kind of conversion should I try to display the file correctly?
How can I achieve the same programmatically in C#?
It is common on windows that a file's encoding doesn't match what the editor or even its xml header say it is. People are sloppy. Maybe it's really UTF-16, or the unstandard windows extended ascii thing which I think is probably cp-1252. (It's not common on *nix since we all usually just use utf-8, no need for others... not saying *nix users are much less sloppy)
To figure out which encoding it is, I would make a copy of the file, then delete the bits that are not a problem (leaving Mägenwil as the entire file) and then save, and use the linux command "file" which will tell what the right encoding is (reliable only for small files... it doesn't read the whole file; maybe notepad++ will do the exact same thing). The reason for deleting the other bits is that it might be a mix of UTF-8 which the editor has used for detection, plus something else.
I would try the iconv command in linux to test. For example:
iconv -f UTF-16 -t UTF-8 -o outfile infile
And any encoding conversion should be possible in C# or any featureful language, as long as you know how it was mutilated so you can reverse it. And if you find that it is part utf-8 and part something else, then remember not to convert the whole file, but only the important parts.

ANSI vs SHIFT JIS vs UTF-8 in c#

I have been trying to figure the difference for quite sometime now. The issue is with a file that is in ANSI encoding has japanese characters like: ­‚È‚­‚Æ‚à1‚‚ÌINCREMENTs‚ª•K—v‚Å‚·. It equivalent in shift-jis is 少なくとも1つのINCREMENT行が必要です. which is expected to be in japanese.
I need to display these characters after reading from file(in ANSI) on a webpage. There are some other files in UTF-8 displaying characters right not seeing this. I am finding it difficult to figure out whats the difference and how do I change encoding to do right things here..
I use c# for reading this file and displaying it, I also need to write the string back into file if its modified on web. Any encoding and decoding schemas here?
As far as code pages are concerned, "ANSI" (and Encoding.Default in .NET) basically just means "the non-Unicode codepage used by this system" - exactly what codepage that is, depends on how the system is configured, but on a Western European system, it's likely to be Windows-1252.
For the system where that text comes from, then "ANSI" would appear to mean Shift-JIS - so unless your system has the same code page, you'll need to tell your code to read the text as Shift-JIS.
Assuming you're reading the file with a StreamReader, there are various constructors that take an Encoding, so just grab a Shift-JIS encoding with Encoding.GetEncoding("shift_jis") or Encoding.GetEncoding(932) and use it to construct your StreamReader.

Silverlight displaying asia language?

My application require 2 different asia language support: Chinese and Tamil.
It should be able to cater for change without compiling, something like java's resource bundle.
In this case, if I input unicode on a external file and get silverlight to read as string, will silverlight be able to parse it correctly?
Or I can use the chinese/ tamil characters directly in the external file, but I'm not sure how to retrieve these characters in code.
Anyway these language will be shown on the same screen, so I don't think localization will help.
Just place the content in Xml (probably Xaml Resource dictionary) using the default UTF-8 encoding.

Categories