Reading a part of PDF file in c# - c#

I have many large size PDF files that I need to only read a part of them. I want to start reading the PDF file and write it to another file like a txt file, or any other type of files.
However, I want to make a limitation on the size of the file that I am writing in. When the size of txt file is about 15 MB, I should stop reading the PDF document and then I keep the created txt file for my purpose.
Does anyone can help me how can I do this in C#?
Thanks for your help in advance.
Here is the code that I use for reading the whole file; (image content is not important for me)
using (StreamReader sr = new StreamReader(#"F:\1.pdf"))
{
using (StreamWriter sw = new StreamWriter(#"F:\test.txt"))
{
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
sw.WriteLine(line);
sw.Flush();
}
}
}

You have to use PDF library to do this.There are a lot of free and paid PDF libraries out there which can be used to do your task. Recently I have used EO.pdf library to read pdf page and extract page content. The best part is that it has NuGet package and also continuously developed. The cons is you have to pay for commercial use.

PDF can't be read directly using .NET. You should first convert PDF to text (or XML, or HTML).
there are lot of PDF libraries capable of converting PDF to text like iTextSharp (most popular and open-source) and lot of other tools
To control the size of the output text files you should
get number of pages from PDF
run pdf to text conversion page by page meanwhile checking the output text file size
once file size is over 15 MB just stop the conversion and move to another file

Related

Unable to convert PCL file to PDF using Filestream and PDFSharp

I was unable to find a free library which can directly convert PCL file to PDF file, i had a thought of reading the PCL file into FileStream and saving it to a PDF document using PDF Sharp.
I tried the following code, but it gives me a blank PDF Document.
Can someone let me know what or where i'm doing it incorrectly?
private void pdfSharpPclToPDF(string localPCLPath)
{
using(FileStream fleStream = new FileStream(localPCLPath, FileMode.Open))
{
using (PdfDocument newPDF = new PdfDocument(fleStream))
{
PdfPage pdfPage = new PdfPage(newPDF);
newPDF.Pages.Add(pdfPage);
newPDF.Save("D:\\Research\\PDF_Files\\output.pdf");
}
}
}
It would also help if someone could suggest any other open source library that can do the job for me.
PDFsharp does not use the filestream you create. If you invoke Save() without filename, PDFsharp will save the PDF document to the stream. If you invoke Save(<filename>) the document will be saved in that file.
PDFsharp cannot read PCL files. You are trying something that cannot work with PDFsharp.
Most Applications will usually include some way to export content as a printer output. The higher end various output formats are generically referred to as PDL (Page Description Language) and historically were stored as filename.prn without distinguishing content.
The contents of a PRN could traditionally be Esc code as used by Epson , PostScript Programs or PCL (Printer Control Language) plus many others and nowadays we often include the formerly dumber PDF which is accepted by high end printers.
So PDFsharp can Export to PCL via a driver but is not designed to Import it for conversion into PDF output.
One application that can convert bidirectionally between PDL devices is Artifex GhostPDL (it does NOT Import Epson Esc Code but can export Epson code). GhostScript is Open Source, as you request, but is (AGPL) commercial licensed. However it is the most capable with a few decades of history.

Extract Contents code from PDF / Word File

I have to big files of MS Word & PDF which contains images, text fields, tables.
I need to insert text into these files dynamically at specific locations. I've tried Bookmarks method in Word but I can't use that method now. I've extracted data into byte array and tried to write in pdf but file gets corrupted. Here is the code:
byte[] bytes = System.IO.File.ReadAllBytes("CDC.doc");
FileStream fs = new FileStream("CDC.pdf", FileMode.OpenOrCreate);
fs.Write(bytes, 0, bytes.Length);
fs.Close();
Is there any way that I can convert these pdf/ word files to get PDF code for these files and then I can append data to specific locations in that code. Please advise. Thanks!
If I understand you right, you would like to develop a code that would replace all placeholders in a Word document acting as a template with your application data. For placeholders you can use Bookmarks, but a better choice would be Content Controls. You can use Open XML SDK to parse such a template Word document and replace Content Controls with data. This approach uses a free MS library but is tedious.
A much easier approach would be using a ready-made library which can work with templates, which contain placeholders that will get replaced with your real app data at runtime. In your C# application you can prepare the data (as C# data objects or XML) and merge this data with the template. Output can be in docx, pdf or xps format. You can check out some of the examples here.

How to compress PDF file that is generated using HTML-Renderer library c#

i have the following code which is working fine. I am generating pdffile through html content using library HTML-Randerer. Problem is that generated pdf file size become increases up to aprox 8MB some time.and i wanted to compress that file as i have to send this pdf to some of the client . I have searched but did not find any solution.
Note:"Html content has no image just tag and textual data only "
Library:
HTML Renderer
Question: How to compress the pdf file using HTML Renderer is there any method ?
Code
public void GenerateHtmlToPdf(string htmlContent, string sFilePath)
{
PdfDocument pdf = PdfGenerator.GeneratePdf(htmlContent, PageSize.A4, 25);
pdf.Save(sFilePath);
}
Thanks!
The PDF "foo.pdf" you provide contains 33 pages, each pages is between 150 kB to 350 kB in size.
Each page draws hundreds of lines. Maybe the library draws four lines around each cell. The same visual effect could be achieved with a few, long lines which would reduce the file size considerably.
There are many PDF Optimizer and PDF Compacter around, but I'm afraid they won't do much good with that file.
It seems you can't do it with pdfsharp. And because you don't use images, there is no help there either in that direction. You may read up on alternatives in another post Compress existing PDF using C# programming

How can I read any file into a string

I want to be able to read any file into a string, for instance the way notepad might open a word file. Using the following code:
StreamReader sr = new StreamReader(filePath);
text += sr.ReadToEnd();
sr.Close();
works fine on a basic text file but when using it on say a word file I just get a few odd characters whereas opening the same file in notepad shows me the entire file, text, special characters etc. I'm using this as part of a file drop into a textbox. Basically I'm looking to get the same output you would get when you open any file in notepad. What should I be using instead?
Using your code from the original question and opening a file, does show the entire stream (when looking it in debugger) - The problem is that most of these binary files have null terminators (\0 char) which will cause most viewers to stop reading the contents of the stream.
If you remove/escape the '\0' you'll see the entire stream just like in notepad.
For example:
string filePath = #"c:\windows\system32\calc.exe";
StreamReader sr = new StreamReader(filePath);
string text = sr.ReadToEnd();
sr.Close();
textBox1.Text = text.Replace('\0', ' ');
Add a textbox1 to a form and see for yourself... You'll see the entire stream...
This should give you the functionality that you want. First read the file in as a byte[] using
byte[] data = File.ReadAllBytes(fileName);
then just encode it with ascii, or whatever.
string s = Encoding.ASCII.GetString(data);
I'm assuming you're referring to WordPad, which is also included with Windows, rather than Notepad. WordPad, in addition to showing basic text files, also knows to parse and edit Word files (.DOCX, but oddly enough not the older .DOC files), Rich Text Format files (.RTF), and OpenOffice documents (*.ODT). This doesn't come freely just by opening the Word file and displaying its content - there is a lot of code inside WordPad to parse this binary data and display it properly, not to mention the code to edit and save it again.
If you need to retrieve the data from Word files, there are several programmatic options, starting with automating the Word application itself using the Word APIs. However, this solution is problematic for running on a server, or if you need to open them where there is no Word installed.
In this case you also have several options. For post-2007 documents with the .DOCX extension, you can use the System.IO.Packaging namespace to open the DOCX and extract its relevant parts, but it's up to you to understand the syntax of the XML files within. Alternately, you can purchase a third-party library that does it for you, such as Aspose, which I've worked with and were fine. There are others out there too.

Reading Data From PDF File And Write It Into Word File?

How Can We Read Data From PDF File And Write It In Word File Using Asp.net C# Code...?
You can use the IFilter capabilities built into Windows, here's an article with some example code:
Using-IFilter-in-C
The issue with PDF files is that even if you're able to extract the plaintext of the PDF in readable form (which is not a guarantee by any stretch), the text will be completely unformatted. Even simple things like line breaks will be lost in many cases.

Categories