I want to read a pdf from the request body and write it to a file.
Basically the raw request body looks something like this:
%PDF-1.7
4 0 obj
(Identity)
endobj
5 0 obj
(Adobe)
endobj
8 0 obj
<<
/Filter /FlateDecode
/Length 33775
/Length1 81256
/Type /Stream
>>
stream
...
%%EOF
I wont post the whole request body as it is quite big.
In my code I read it as the following
var content = (HttpContent)new StreamContent(Request.Body);
var bytes = await content.ReadAsByteArrayAsync();
Then write the bytes to a file
using (FileStream file = new FileStream(fileName, FileMode.Create, System.IO.FileAccess.Write))
{
file.Write(bytes, 0, bytes.Length);
}
However, the text in that pdf is not getting written to the file. It's just a blank PDF. The pdf is suppose to show the text "test".
EDIT:
New findings.
When testing my code through postman, I found using the postman body binary section to upload a pdf works just fine but when sending the same pdf raw content through the postman body raw section it does not work. Is it because some information is loss when copying the raw content of the pdf directly to postman?
Related
I'll get straight to the point: how to upload PDF files from a C# backend into a HTTP web service inside a multipart/form-data request without the contents being mangled to the point of the file becoming unreadable? The web service documentation only states that text files should be text/plain and image files should be binary; PDF files are only mentioned as "also supported", with no mention of what format or encoding they should be in.
The code I'm using to create the request:
HttpWebRequest request;
string boundary = "---------------------------" + DateTime.Now.Ticks.ToString("x");
request.ContentType = "multipart/form-data; boundary=" + boundary;
using (StreamWriter sw = new StreamWriter(request.GetRequestStream())) {
sw.WriteLine("--" + boundary);
sw.WriteLine("Content-Disposition: form-data; name=\"files\"; filename=\"" + Path.GetFileName(filePath) + "\"");
sw.WriteLine(filePath.EndsWith(".pdf") ? "Content-Type: application/pdf" : "Content-Type: text/plain");
sw.WriteLine();
if (filePath.EndsWith(".pdf")) {
// write PDF content into the request stream
}
else sw.WriteLine(File.ReadAllText(filePath));
sw.Write("--" + boundary);
sw.Write("--");
sw.Flush();
}
For simple text files, this code works just fine. However, I have trouble uploading a PDF file.
Writing the file into the request body using StreamWriter.WriteLine with either File.ReadAllText or Encoding.UTF8.GetString(File.ReadAllBytes) results in the uploaded file being unreadable due to .NET having replaced all the non-UTF-8 bytes with squares (which somehow also increased file size by over 100 kB). Same result with UTF-7 and ANSI, but UTF-8 results in the closest match to the original file's contents.
Writing the file into the request body as binary data using either BinaryWriter or Stream.Write results in the web service rejecting it outright as invalid POST data. Content-Transfer-Encoding: binary (indicated by the documentation as necessary for application/http, hence why I tried) also causes rejection.
What alternative options are available? How can I encode PDF without .NET silently replacing the invalid bytes with placeholder characters? Note that I have no control over what kind of content the web service accepts; if I did, I'd already have moved on to base64.
Problem solved, my bad. The multipart form header and the binary data were both correct but were in the wrong order because I didn't Flush() the StreamWriter before writing the binary data into the request stream with Stream.CopyTo().
Moral of the story: if you're writing into the same Stream with more than one Writer at the same time, always Flush() before doing anything with the next Writer.
Here is my sample code:
CodeSnippet 1: This code executes in my file repository server and returns the file as encoded string using the WCF Service:
byte[] fileBytes = new byte[0];
using (FileStream stream = System.IO.File.OpenRead(#"D:\PDFFiles\Sample1.pdf"))
{
fileBytes = new byte[stream.Length];
stream.Read(fileBytes, 0, fileBytes.Length);
stream.Close();
}
string retVal = System.Text.Encoding.Default.GetString(fileBytes); // fileBytes size is 209050
Code Snippet 2:
Client box, which demanded the PDF file, receives the encoded string and converts to PDF and save to local.
byte[] encodedBytes = System.Text.Encoding.Default.GetBytes(retVal); /// GETTING corrupted here
string pdfPath = #"C:\DemoPDF\Sample2.pdf";
using (FileStream fileStream = new FileStream(pdfPath, FileMode.Create)) //encodedBytes is 327279
{
fileStream.Write(encodedBytes, 0, encodedBytes.Length);
fileStream.Close();
}
Above code working absolutely fine Framework 4.5 , 4.6.1
When I use the same code in Asp.Net Core 2.0, it fails to convert to Byte Array properly. I am not getting any runtime error but, the final PDF is not able to open after it is created. Throws error as pdf file is corrupted.
I tried with Encoding.Unicode and Encoding.UTF-8 also. But getting same error for final PDF.
Also, I have noticed that when I use Encoding.Unicode, atleast the Original Byte Array and Result byte array size are same. But other encoding types are mismatching with bytes size also.
So, the question is, System.Text.Encoding.Default.GetBytes broken in .NET Core 2.0 ?
I have edited my question for better understanding.
Sample1.pdf exists on a different server and communicate using WCF to transmit the data to Client which stores the file encoded stream and converts as Sample2.pdf
Hopefully my question makes some sense now.
1: the number of times you should ever use Encoding.Default is essentially zero; there may be a hypothetical case, but if there is one: it is elusive
2: PDF files are not text, so trying to use an Encoding on them is just... wrong; you aren't "GETTING corrupted here" - it just isn't text.
You may wish to see Extracting text from PDFs in C# or Reading text from PDF in .NET
If you simply wish to copy the content without parsing it: File.Copy or Stream.CopyTo are good options.
I have a PDF document which becomes encrypted. During encryption a code is embedded the then end of the filestream before writing to a file.
This PDF is later decrypted and the details are view-able in any PDF viewer.
The issue is the embedded code is also then visible in the decrypted PDF and it needs removing.
I'm looking to decrypt the PDF document, remove the embedded document code then save it to a filename.
//Reading the PDF
Encoding enc = Encoding.GetEncoding("us-ascii");
while ((read = cs.Read(buffer, 0, buffer.Length)) > 0)
{
System.Text.Encoding.UTF8.GetString(buffer);
x = x + enc.GetString(buffer);
}
//Remove the code
x = x.Replace("CODE","");
//Write file
byte[] bytes = enc.GetBytes(x);
File.WriteAllBytes(#filePath, bytes);
When the original file is generated it appears to be using a different encoder because the first line on the original file reads %PDF-1.6%âãÏÓ and on the decoded file %PDF-1.6 %????.
I have tried ascii, us-ascii, UTF8 and Unicode but upon removal of the embedded CODE the file stoped opening due to corruption. Note the embedded code sits in the raw file after the PDF %%EOF tag.
Has anyone any ideas?
I am uploading an .mp3 file via FTP code using C#, the file is uploaded successfully on server but when i bind to a simple audio control or directly view in browser it does not work as expected, whereas when i upload manually on the server it works perfectly.
Code:
var inputStream = FileUpload1.PostedFile.InputStream;
byte[] fileBytes = new byte[inputStream.Length];
inputStream.Read(fileBytes, 0, fileBytes.Length);
Note: When i view the file in Firefox it shows MIME type is not supported.
Thanks!
You're reading the file as a string then using UTF8 encoding to turn it into bytes. If you do that, and the file contains any binary sequence that doesn't code to a valid UTF8 value, parts of the data stream will simply get discarded.
Instead, read it directly as bytes. Don't bother with the StreamReader. Call the Read() method on the underlying stream. Example:
var inputStream = FileUpload1.PostedFile.InputStream
byte[] fileBytes = new byte[inputStream.Length];
inputStream.Read(fileBytes, 0, fileStream.Length);
I posted a question related to this a while back but got no responses. Since then, I've discovered that the PDF is encoded using FlateDecode, and I was wondering if there is a way to manually decode the PDF in C# (Windows Phone 8)? I'm getting output like the following:
%PDF-1.5
%????
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
5 0 obj
<<
/Filter /FlateDecode
/Length 9
>>
stream x^+
The PDF has been created using the SyncFusion PDF controls for Windows Phone 8. Unfortunately, they do not currently have a text extraction feature, and I couldn't find that feature in other WP PDF controls either.
Basically, all I want is to download the PDF from OneDrive and read the PDF contents. Curious if this is easily doable?
private static string decompress(byte[] input)
{
byte[] cutinput = new byte[input.Length - 2];
Array.Copy(input, 2, cutinput, 0, cutinput.Length);
var stream = new MemoryStream();
using (var compressStream = new MemoryStream(cutinput))
using (var decompressor = new DeflateStream(compressStream, CompressionMode.Decompress))
decompressor.CopyTo(stream);
return Encoding.Default.GetString(stream.ToArray());
}
According to below similar question the first 2 bytes of the stream has to be cut from the stream. This is done in above function. Just pass all bytes of the stream to input. Make sure the bytecount is the same as the length specified.
C# decode (decompress) Deflate data of PDF File
The easiest solution is to use DeflateStream provided by .NET framework. Example can be found in similar thread. This approach might have some pitfalls.
If this doesn't work, there are libraries (like DotNetZip), capable of deflate stream decompression. Please check this link for performance comparison.
The last possible option I see, without reinventing wheel is to use other PDF parsing libraries and use them for stream decompression, or even for whole PDF processing.