What is wrong with my encoding, when reading characters from PDF?

What is wrong with my encoding, when reading characters from PDF? - c#

I'm reading a PDF file with C#, but the characters are coming from another encoding, and returning different characters than those which I expected from when I view the file in a PDF viewer.
I thought a UTF-8 encoding would be correct.
What am I doing wrong?
string file = #"c:\document.pdf";
Stream stream = File.Open(file, FileMode.Open);
BinaryReader binaryReady = new BinaryReader(stream);
byte[] buffer = binaryReady.ReadBytes(Convert.ToInt32(stream.Length));
var encoder = UTF8Encoding.UTF8.GetString(buffer);

PDF is a very complex multi-part file, it is not just UTF8 text.
If you want to read a PDF file, you must read over the full PDF File Format Documentation and fully implement the large and complex details of how the file format works.

Related

c# How to undo Encoding.UTF8.GetBytes or convert to File.ReadAllBytes

C# application was written, to transfer files to FTP server. And function below was used to read jpeg file. This is bad function because it corrupts jpeg :
StreamReader sourceStream = new StreamReader("image.jpeg");
byte[] fileContents = Encoding.UTF8.GetBytes(sourceStream.ReadToEnd());
The code below would work for the file transfer.:
fileContents = File.ReadAllBytes(sourceStream.ReadToEnd());
And now i have library of corrupted jpegs.
How to fix the mess?

You shouldn't use StreamReader at all for reading binary files, it's a TextReader. Even your 2nd piece of code is wrong, unless sourceStream only contains a file name.
It's likely that your data is corrupted beyond repair. You can do the inverse with Encoding.UTF8.GetString and StreamWriter, but your encoding has most likely caused irreparable damage already.

Exporting a Microsoft report to PDF doesn't show Chinese characters

I've got the problem that I get no Chinese characters when exporting a Microsoft Report to a PDF file.
byte[] mybytes = report.Render("pdf");
using (FileStream fs = File.Create(#"D:\output.pdf"))
{
fs.Write(mybytes, 0, mybytes.Length);
}
If I export the same report to a Word file it works fine.
byte[] myWordbytes = report.Render("word");
using (FileStream fs = File.Create(#"D:\output.doc"))
{
fs.Write(myWordbytes, 0, myWordbytes.Length);
}
When converting that Word file to PDF, I also get the Chinese characters in the converted PDF file.
I don't want to do this workaround. How can I solve this?
The required fonts seem to be embedded into the PDF.
enter image description here

Opening a PDF file as a raw text document

I have a PDF document which becomes encrypted. During encryption a code is embedded the then end of the filestream before writing to a file.
This PDF is later decrypted and the details are view-able in any PDF viewer.
The issue is the embedded code is also then visible in the decrypted PDF and it needs removing.
I'm looking to decrypt the PDF document, remove the embedded document code then save it to a filename.
//Reading the PDF
Encoding enc = Encoding.GetEncoding("us-ascii");
while ((read = cs.Read(buffer, 0, buffer.Length)) > 0)
{
System.Text.Encoding.UTF8.GetString(buffer);
x = x + enc.GetString(buffer);
}
//Remove the code
x = x.Replace("CODE","");
//Write file
byte[] bytes = enc.GetBytes(x);
File.WriteAllBytes(#filePath, bytes);
When the original file is generated it appears to be using a different encoder because the first line on the original file reads %PDF-1.6%âãÏÓ and on the decoded file %PDF-1.6 %????.
I have tried ascii, us-ascii, UTF8 and Unicode but upon removal of the embedded CODE the file stoped opening due to corruption. Note the embedded code sits in the raw file after the PDF %%EOF tag.
Has anyone any ideas?

Problems with strings in the CSV file

I have an application that reads information from a CSV file to write it to the database. But some characters (example: º ç) are appearing problems Gravalos base. Anyone know how to fix this problem?
Thank you.
I'm using these lines of code to read the information from the CSV file:
string directory = #"C:\test.csv";
StreamReader stream = new StreamReader(directory);
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');

StreamReader defaults to UTF8 encoding and your file is in a different encoding. Try specifying it like this...
var encoding = Encoding.UTF16;
StreamReader stream = new StreamReader(directory, encoding);
Note that you need to know what encoding the file is in to properly read it... I'm just guessing that it might be UTF16 but obviously I can't know what it is.

You should specify the right encoding when reading the file. The default is UTF-8. Your file is probably encoded with a different encoding.

This is most likely related to the Encoding that is used when reading the file. By default, UTF8 is assumed as the Encoding. In order to read the file correctly, you need to specify the right encoding, e.g.:
string directory = #"C:\test.csv";
using(StreamReader stream = new StreamReader(directory, Encoding.ASCII))
{
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');
}
You can try the following encodings (see this link for a complete list):
Encoding.Default for ANSI encoding based in the current windows code page.
Encoding.ASCII for ASCII encoding.
Encoding.UTF* for different Unicode encodings.
Please note that I enclosed the StreamReader in a using block so that it is disposed when it is not needed anymore.

C# File Encoding Type changed?

I am creating a file with ASCII encoding, but when I test to get the Encoding type of that file, it is returning UTF8Encoding.
Can anyone explain the reason or figure my mistake??
CODE:
Creating File:
FileStream _textStream = File.Open("CreateAsciiFile.txt", FileMode.Create, FileAccess.Write);
StreamWriter _streamWriter = new StreamWriter(_textStream, System.Text.Encoding.ASCII);
Byte[] byteContent = BtyeTowrite(); // This returns the array of byte
foreach(var myByte in byteContent)
_streamWriter.Write(System.Convert.ToChar(myByte));
Reading a file:
StreamReader sr = new StreamReader(#"C:\CreateAsciiFile.txt",true);
string LineText= sr.ReadLine();
System.Text.Encoding enc = sr.CurrentEncoding;
Here enc gives UTF8Encoding... But I am expecting ASCII ???

You need to read from the reader before querying the encoding. So before calling sr.CurrentEncoding try reading something. The StreamReader looks at the first bytes to try to guess the encoding and because ASCII has no BOM it might not be recognizable as such and you might get wrong results. For example there is no difference (at the binary level) between an ASCII encoded file a ISO-8859-1 encoded file.

The answer is probably here:"
Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value.
In other words, your ASCII file is both valid UTF-8 and ASCII. It is detected as UTF-8.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

What is wrong with my encoding, when reading characters from PDF? - c#

PDF is a very complex multi-part file, it is not just UTF8 text. If you want to read a PDF file, you must read over the full PDF File Format Documentation and fully implement the large and complex details of how the file format works.

Related

c# How to undo Encoding.UTF8.GetBytes or convert to File.ReadAllBytes

Exporting a Microsoft report to PDF doesn't show Chinese characters

Opening a PDF file as a raw text document

Problems with strings in the CSV file

C# File Encoding Type changed?

Categories

Resources