Zip file with utf-8 file names - c#

In my website i have option to download all images uploaded by users. The problem is in images with hebrew names (i need original name of file). I tried to decode file names but this is not helping. Here is a code :
using ICSharpCode.SharpZipLib.Zip;
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(file.Name);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string name = iso.GetString(isoBytes);
var entry = new ZipEntry(name + ".jpg");
zipStream.PutNextEntry(entry);
using (var reader = new System.IO.FileStream(file.Name, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
byte[] buffer = new byte[ChunkSize];
int bytesRead;
while ((bytesRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
byte[] actual = new byte[bytesRead];
Buffer.BlockCopy(buffer, 0, actual, 0, bytesRead);
zipStream.Write(actual, 0, actual.Length);
}
}
After utf-8 encoding i get hebrew file names like this : ??????.jpg
Where is my mistake?

Unicode (UTF-8 is one of the binary encoding) can represent more characters than the other 8-bit encoding. Moreover, you are not doing a proper conversion but a re-interpretation, which means that you get garbage for your filenames. You should really read the article from Joel on Unicode.
...
Now that you've read the article, you should know that in C# string can store unicode data, so you probably don't need to do any conversion of file.Name and can pass this directly to ZipEntry constructor if the library does not contains encoding handling bugs (this is always possible).

Try using
ZipStrings.UseUnicode = true;
It should be a part of the ICSharpCode.SharpZipLib.Zip namespace.
After that you can use something like
var newZipEntry = new ZipEntry($"My ünicödë string.pdf");
and add the entry as normal to the stream. You shouldn't need to do any conversion of the string before that in C#.

You are doing wrong conversion, since strings in C# are already unicode.
What tools do you use to check file names in archive?
By default Windows ZIP implementations use system DOS encoding for file names, while other implementations can use other encoding.

Related

Converting a byte[] string back to byte[] array

I have one scenario with class like this.
Class Document
{
public string Name {get;set;}
public byte[] Contents {get;set;}
}
Now I am trying to implement the import export functionality where I keep the document in binary so the document will be in json file with other fields and the document will be something in this format.
UEsDBBQABgAIAAAAIQCitGbRsgEAALEHAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAACAAAAAAA==
Now when I upload this file back, I get this file as a string and I get the same data but when I try to convert this in binary bytes[] the file become corrupt.
How can I achieve this ?
I use something like this to convert
var ss = sr.ReadToEnd();
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream);
writer.Write(ss);
writer.Flush();
stream.Position = 0;
var bytes = default(byte[]);
bytes = stream.ToArray();
This looks like base 64. Use:
System.Convert.ToBase64String(b)
https://msdn.microsoft.com/en-us/library/dhx0d524%28v=vs.110%29.aspx
And
System.Convert.FromBase64String(s)
https://msdn.microsoft.com/en-us/library/system.convert.frombase64string%28v=vs.110%29.aspx
You need to de-code it from base64, like this:
Assuming you've read the file into ss as a string.
var bytes = Convert.FromBase64String(ss);
There are several things going on here. You need to know the encoding for the default StreamWriter, if it is not specified it defaults to UTF-8 encoding. However, .NET strings are always either UNICODE or UTF-16.
MemoryStream from string - confusion about Encoding to use
I would suggest using System.Convert.ToBase64String(someByteArray) and its counterpart System.Convert.FromBase64String(someString) to handle this for you.

Why Empty Text File Contains 3 bytes?

I'm using a text file inside my C# project in vs2010. I added to solution and set its "Copy Output" to "Copy Always". When I use the following codes, it gives me the text result with leading three bytes or in utf8 one byte. I looked at windows explorers file properties, its size appears 3 bytes.
public static string ReadFile(string fileName)
{
FileStream fs = null;
try
{
fs = new FileStream(fileName, FileMode.Open);
FileInfo fi = new FileInfo(fileName);
byte[] data = new byte[fi.Length];
fs.Read(data, 0, data.Length);
fs.Close();
fs.Dispose();
string text = Encoding.ASCII.GetString(data);
return text;
}
catch (Exception)
{
if(fs != null)
{
fs.Close();
fs.Dispose();
}
return string.Empty;
}
}
Why is this like above? How can I read text files without StreamReader class?
Any helps, codes wil be very appreciated.
So, those three bytes you are seeing are the byte order marker for the unicode file I am guessing. For UTF-8, it is three bytes.
You can avoid those by saving the file using UTF-8 without signature.

binary file to string

i'm trying to read a binary file (for example an executable) into a string, then write it back
FileStream fs = new FileStream("C:\\tvin.exe", FileMode.Open);
BinaryReader br = new BinaryReader(fs);
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
System.Text.Encoding enc = System.Text.Encoding.ASCII;
string myString = enc.GetString(bin);
fs.Close();
br.Close();
System.Text.ASCIIEncoding encoding = new System.Text.ASCIIEncoding();
byte[] rebin = encoding.GetBytes(myString);
FileStream fs2 = new FileStream("C:\\tvout.exe", FileMode.Create);
BinaryWriter bw = new BinaryWriter(fs2);
bw.Write(rebin);
fs2.Close();
bw.Close();
this does not work (the result has exactly the same size in bytes but can't run)
if i do bw.Write(bin) the result is ok, but i must save it to a string
When you decode the bytes into a string, and re-encodes them back into bytes, you're losing information. ASCII in particular is a very bad choice for this since ASCII will throw out a lot of information on the way, but you risk losing information when encoding and decoding regardless of the type of Encoding you pick, so you're not on the right path.
What you need is one of the BaseXX routines, that encodes binary data to printable characters, typically for storage or transmission over a medium that only allows text (email and usenet comes to mind.)
Ascii85 is one such algorithm, and the page contains links to different implementations. It has a ratio of 4:5 meaning that 4 bytes will be encoded as 5 characters (a 25% increase in size.)
If nothing else, there's already a Base64 encoding routine built into .NET. It has a ratio of 3:4 (a 33% increase in size), here:
Convert.ToBase64String Method
Convert.FromBase64String Method
Here's what your code can look like with these methods:
string myString;
using (FileStream fs = new FileStream("C:\\tvin.exe", FileMode.Open))
using (BinaryReader br = new BinaryReader(fs))
{
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
myString = Convert.ToBase64String(bin);
}
byte[] rebin = Convert.FromBase64String(myString);
using (FileStream fs2 = new FileStream("C:\\tvout.exe", FileMode.Create))
using (BinaryWriter bw = new BinaryWriter(fs2))
bw.Write(rebin);
I don't think you can represent all bytes with ASCII in that way. Base64 is an alternative, but with a ratio between bytes and text of 3:4.

Modifying XMP data with C#

I'm using C# in ASP.NET version 2. I'm trying to open an image file, read (and change) the XMP header, and close it back up again. I can't upgrade ASP, so WIC is out, and I just can't figure out how to get this working.
Here's what I have so far:
Bitmap bmp = new Bitmap(Server.MapPath(imageFile));
MemoryStream ms = new MemoryStream();
StreamReader sr = new StreamReader(Server.MapPath(imageFile));
*[stuff with find and replace here]*
byte[] data = ToByteArray(sr.ReadToEnd());
ms = new MemoryStream(data);
originalImage = System.Drawing.Image.FromStream(ms);
Any suggestions?
How about this kinda thing?
byte[] data = File.ReadAllBytes(path);
... find & replace bit here ...
File.WriteAllBytes(path, data);
Also, i really recommend against using System.Bitmap in an asp.net process, as it leaks memory and will crash/randomly fail every now and again (even MS admit this)
Here's the bit from MS about why System.Drawing.Bitmap isn't stable:
http://msdn.microsoft.com/en-us/library/system.drawing.aspx
"Caution:
Classes within the System.Drawing namespace are not supported for use within a Windows or ASP.NET service. Attempting to use these classes from within one of these application types may produce unexpected problems, such as diminished service performance and run-time exceptions."
Part 1 of the XMP spec 2012, page 10 specifically talks about how to edit a file in place without needing to understand the surrounding format (although they do suggest this as a last resort). The embedded XMP packet looks like this:
<?xpacket begin="■" id="W5M0MpCehiHzreSzNTczkc9d"?>
... the serialized XMP as described above: ...
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf= ...>
...
</rdf:RDF>
</x:xmpmeta>
... XML whitespace as padding ...
<?xpacket end="w"?>
In this example, ‘■’ represents the
Unicode “zero width non-breaking space
character” (U+FEFF) used as a
byte-order marker.
The (XMP Spec 2010, Part 3, Page 12) also gives specific byte patterns (UTF-8, UTF16, big/little endian) to look for when scanning the bytes. This would complement Chris' answer about reading the file in as a giant byte stream.
You can use the following functions to read/write the binary data:
public byte[] GetBinaryData(string path, int bufferSize)
{
MemoryStream ms = new MemoryStream();
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read))
{
int bytesRead;
byte[] buffer = new byte[bufferSize];
while((bytesRead = fs.Read(buffer,0,bufferSize))>0)
{
ms.Write(buffer,0,bytesRead);
}
}
return(ms.ToArray());
}
public void SaveBinaryData(string path, byte[] data, int bufferSize)
{
using (FileStream fs = File.Open(path, FileMode.Create, FileAccess.Write))
{
int totalBytesSaved = 0;
while (totalBytesSaved<data.Length)
{
int remainingBytes = Math.Min(bufferSize, data.Length - totalBytesSaved);
fs.Write(data, totalBytesSaved, remainingBytes);
totalBytesSaved += remainingBytes;
}
}
}
However, loading entire images to memory would use quite a bit of RAM. I don't know much about XMP headers, but if possible you should:
Load only the headers in memory
Manipulate the headers in memory
Write the headers to a new file
Copy the remaining data from the original file

Why does text from Assembly.GetManifestResourceStream() start with three junk characters?

I have a SQL file added to my VS.NET 2008 project as an embedded resource. Whenever I use the following code to read the file's content, the string returned always starts with three junk characters and then the text I expect. I assume this has something to do with the Encoding.Default I am using, but that is just a guess. Why does this text keep showing up? Should I just trim off the first three characters or is there a more informed approach?
public string GetUpdateRestoreSchemaScript()
{
var type = GetType();
var a = Assembly.GetAssembly(type);
var script = "UpdateRestoreSchema.sql";
var resourceName = String.Concat(type.Namespace, ".", script);
using(Stream stream = a.GetManifestResourceStream(resourceName))
{
byte[] buffer = new byte[stream.Length];
stream.Read(buffer, 0, buffer.Length);
// UPDATE: Should be Encoding.UTF8
return Encoding.Default.GetString(buffer);
}
}
Update:
I now know that my code works as expected if I simply change the last line to return a UTF-8 encoded string. It will always be true for this embedded file, but will it always be true? Is there a way to test any buffer to determine its encoding?
Probably the file is in utf-8 encoding and Encoding.Default is ASCII. Why don't you use a specific encoding?
Edit to answer a comment:
In order to guess the file encoding you could look for BOM at the start of the stream. If it exists, it helps, if not then you can only guess or ask user.
if you try to load xml from assembly you actually need to inspect and skip the byte order mark bytes (drove me nuts):
....
byte[] data;
using (var stream = assembly.GetManifestResourceStream(filename))
{
var length = stream.Length;
data = new byte[length];
stream.Read(data, 0, (int) length);
}
if (!HasUtf8ByteOrderMark(data))
{
throw new InvalidOperationException("Expected UTF8 byte order mark EF BB BF");
}
return Encoding.UTF8.GetChars(data.Skip(3).ToArray());
And
static bool HasUtf8ByteOrderMark(byte[] data)
{
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
return data[0] == bom[0] && data[1] == bom[1] && data[2] == bom[2];
}
More information here
I had the same problem in net.core
You can let streamreader do the encoding
using (var stream = = a.GetManifestResourceStream(resourceName))
using (var reader = new StreamReader(stream))
return reader.ReadToEnd();

Categories