Extract Contents code from PDF / Word File

Extract Contents code from PDF / Word File - c#

I have to big files of MS Word & PDF which contains images, text fields, tables.
I need to insert text into these files dynamically at specific locations. I've tried Bookmarks method in Word but I can't use that method now. I've extracted data into byte array and tried to write in pdf but file gets corrupted. Here is the code:
byte[] bytes = System.IO.File.ReadAllBytes("CDC.doc");
FileStream fs = new FileStream("CDC.pdf", FileMode.OpenOrCreate);
fs.Write(bytes, 0, bytes.Length);
fs.Close();
Is there any way that I can convert these pdf/ word files to get PDF code for these files and then I can append data to specific locations in that code. Please advise. Thanks!

If I understand you right, you would like to develop a code that would replace all placeholders in a Word document acting as a template with your application data. For placeholders you can use Bookmarks, but a better choice would be Content Controls. You can use Open XML SDK to parse such a template Word document and replace Content Controls with data. This approach uses a free MS library but is tedious.
A much easier approach would be using a ready-made library which can work with templates, which contain placeholders that will get replaced with your real app data at runtime. In your C# application you can prepare the data (as C# data objects or XML) and merge this data with the template. Output can be in docx, pdf or xps format. You can check out some of the examples here.

Related

Reading a part of PDF file in c#

I have many large size PDF files that I need to only read a part of them. I want to start reading the PDF file and write it to another file like a txt file, or any other type of files.
However, I want to make a limitation on the size of the file that I am writing in. When the size of txt file is about 15 MB, I should stop reading the PDF document and then I keep the created txt file for my purpose.
Does anyone can help me how can I do this in C#?
Thanks for your help in advance.
Here is the code that I use for reading the whole file; (image content is not important for me)
using (StreamReader sr = new StreamReader(#"F:\1.pdf"))
{
using (StreamWriter sw = new StreamWriter(#"F:\test.txt"))
{
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
sw.WriteLine(line);
sw.Flush();
}
}
}

You have to use PDF library to do this.There are a lot of free and paid PDF libraries out there which can be used to do your task. Recently I have used EO.pdf library to read pdf page and extract page content. The best part is that it has NuGet package and also continuously developed. The cons is you have to pay for commercial use.

PDF can't be read directly using .NET. You should first convert PDF to text (or XML, or HTML).
there are lot of PDF libraries capable of converting PDF to text like iTextSharp (most popular and open-source) and lot of other tools
To control the size of the output text files you should
get number of pages from PDF
run pdf to text conversion page by page meanwhile checking the output text file size
once file size is over 15 MB just stop the conversion and move to another file

How to create a text file from an existing Pdf document in C#.net

I have PDF document data with table structure format and I would like to convert that PDF file into a text file with the same structure with margin and spaces between text in pdf

You need to write your own PDF tool then. Which is not exactly an easy task. Honestly, 3rd party tools make your job much easier, why don't you want to use one?
If you change your mind, I can suggest iTextSharp. I've used it in the past with great success. Here are some example to get you going:
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
ps. there are 3 tools used in there.

Extracting embedded XML File from PDF A/3 using abcpdf in C# - ZUGFeRD

I'm currently working with the new German ZUGFeRD files. These are PDF A/3 files who have an embedded XML file in them which contains data.
I want to extract this XML file from the PDF A/3 using abcpdf 8.1 with C#.
Any idea how to do this ?
Thanks a lot and regards,

I don't know abcpdf but I guess that the pdf libs offer similar access to the pdfs content.
First take a look at Das-ZUGFeRD-Format_1p0.pdf. Especially page 112. The images shows the object tree you have to traverse in order to find the xml stream.
With this tree you have the names, the types and the direction. Now you can traverse the pdf object tree to get to the XML content that you are looking for.
The steps based on the diagram.
Read your PDF
Get the catalog inside your PDF
Get the Array with name AF from Catalog
Get first element from AF array (should be file spec)
From file spec get the dictionary named EF
Get the stream content of EF
This are the steps you need to perform in order to get to the content.
To display the structure of a pdf and browse the tree I would recommend to use a tool like iText RUPS

What did i do with abcpdf:
Get the Objectsoup Array from the Doc (Pretty much an array of all Objects in the Doc)
as ZUGFeRD allows only one embedded file inside the PDF, i just searched this objectsoup-array for the one of the type StreamObject that contains /EmbeddedFile
Decompress the Stream of that object, get the byte[] of the stream and write it into an xml file

Search and Replace PDF using Itext

I need to generate a PDF based on some user inputs. The output PDF have some images, tables and texts. I think that Itext is not user friendly for programmatically generate this report.
Since the report I need to generate is quite complicated, I was wondering if it is possible to create a template PDF and then load -> search -> replace the strings/images I want.
The template PDF can be a tagged pdf.
Is it possible to do that?
Is it the best approach?
EDIT: I´m using WPF + MVVM + .Net 3.5

Replacing text within a PDF file is not simple. The PDF fileformat uses a dictionary at the file end where elements are listet with their byte offset within the file, also some elements have a field where they give their own length given in bytes. If these offsets are not met, the reader will probably report a broken pdf.
You should have a look at reporting as it is made for these tasks:
http://msdn.microsoft.com/en-us/library/bb885185%28v=vs.100%29.aspx
You can create a template with the report designer, set your data and export it to pdf.

Reading Data From PDF File And Write It Into Word File?

How Can We Read Data From PDF File And Write It In Word File Using Asp.net C# Code...?

You can use the IFilter capabilities built into Windows, here's an article with some example code:
Using-IFilter-in-C
The issue with PDF files is that even if you're able to extract the plaintext of the PDF in readable form (which is not a guarantee by any stretch), the text will be completely unformatted. Even simple things like line breaks will be lost in many cases.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract Contents code from PDF / Word File - c#

Related

Reading a part of PDF file in c#

How to create a text file from an existing Pdf document in C#.net

Extracting embedded XML File from PDF A/3 using abcpdf in C# - ZUGFeRD

Search and Replace PDF using Itext

Reading Data From PDF File And Write It Into Word File?

Categories

Resources