How can I combine multiple PDFs into one PDF without a 3rd party component?
I don't think you can.
Opensource component PDFSharp has that functionality, and a nice source code sample on file combining
The .NET Framework does not contain the ability to modify/create PDFs. You need a 3rd party component to accomplish what you are looking for.
As others have said, there is nothing built in to do that task. Use iTextSharp with this example code.
AFAIK C# has no built-in support for handling PDF so what you are asking can not be done without using a 3rd party component or a COTS library.
Regarding libraries there is a myriad of possibilities. Just to point a few:
http://csharp-source.net/open-source/pdf-libraries
http://www.codeproject.com/KB/graphics/giospdfnetlibrary.aspx
http://www.pdftron.com/net/index.html
I don't think .NET Framework contains such like libraries. I used iTextsharp with c# to combine pdf files. I think iTextsharp is easyest way to do this. Here is the code I used.
string[] lstFiles=new string[3];
lstFiles[0]=#"C:/pdf/1.pdf";
lstFiles[1]=#"C:/pdf/2.pdf";
lstFiles[2]=#"C:/pdf/3.pdf";
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage;
string outputPdfPath=#"C:/pdf/new.pdf";
sourceDocument = new Document();
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
//Open the output file
sourceDocument.Open();
try
{
//Loop through the files list
for (int f = 0; f < lstFiles.Length-1; f++)
{
int pages =get_pageCcount(lstFiles[f]);
reader = new PdfReader(lstFiles[f]);
//Add pages of current file
for (int i = 1; i <= pages; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
reader.Close();
}
//At the end save the output file
sourceDocument.Close();
}
catch (Exception ex)
{
throw ex;
}
private int get_pageCcount(string file)
{
using (StreamReader sr = new StreamReader(File.OpenRead(file)))
{
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
}
}
ITextSharp is the way to go
Although it has already been said, you can't manipulate PDFs with the built-in libraries of the .NET Framework. I can however recommend iTextSharp, which is a .NET port of the Java iText. I have played around with it, and found it to be a very easy tool to use.
Related
I'm parsing a PDF file using IText7 in C# that contains Japanese characters like so:
public static string ExtractTextFromPDF(string filePath)
{
var pdfReader = new PdfReader(filePath);
var pdfDoc = new PdfDocument(pdfReader);
var sb = new StringBuilder();
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
var strategy = new SimpleTextExtractionStrategy();
sb.Append(PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy));
}
pdfDoc.Close();
pdfReader.Close();
return sb.ToString();
}
But I run into the exception:
iText.IO.IOException: 'The CMap iText.IO.Font.Cmap.UniJIS-UTF16-H was
not found.'
I've searched around for a solution on how to add this but I haven't come up with anything that works for the Japanese characters. If there is any other library more suited that would also be ok. Any help?
Thanks
Encoding CMaps in particular for CJK scripts are in a separate package.
For .Net use itext7.font-asian via nuget.
For Java use com.itextpdf:font-asian via maven.
The existence of this package is more visible for the Java version than for the .Net version.
The company would like to use the Itextsharp 4.1.6 version specifically and don't want to buy the license (version 5/7).
So, we had already implemented the TextExtract from pdf using the itextsharp 5 version. As we downgraded, this method doesn't support in the 4.16 LGPL version.
So, I looked into many StackOverflow and other sites for the answer. Looks like no custom implementation found other than the below code which exists in AGPL version.
PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())
And byte[] pageContent = reader.GetPageContent(i); gives the byte content, when converted to string it won't give us the exact file text.
As, we do not wish to buy the AGPL version and need to implement the textextractor of pdf, any idea if any other tool supports this/ anybody has the implementation of textextractor.
Any suggestions would be greatly appreciated.
Edit: Refernce for the #jgoday's answer:
With iText 4.1 you can use PdfContentParser (https://github.com/schourode/iTextSharp-LGPL/blob/f75cdad88236d502af42458a420d48be2a47008f/src/core/iTextSharp/text/pdf/PdfContentParser.cs), to parse contents of every page.
using System;
using System.Text;
using iTextSharp.text.pdf;
namespace PdfExtractor
{
class Program
{
static void Main(string[] args)
{
var reader = new PdfReader(#"D:\Tmp\sample.pdf");
try
{
var parser = new PdfContentParser(new PRTokeniser(reader.GetPageContent(2)));
var sb = new StringBuilder();
while (parser.Tokeniser.NextToken())
{
if (parser.Tokeniser.TokenType == PRTokeniser.TK_STRING)
{
string str = parser.Tokeniser.StringValue;
sb.Append(str);
}
}
Console.WriteLine(sb.ToString());
}
finally {
reader.Close();
}
}
}
}
I am trying to create a templating system with OpenXML in our Azure app service-based application (so no Interop) and am running into issues with getting it to work. Here is the code I am currently working with (contents is a byte array):
using(MemoryStream stream = new MemoryStream(contents))
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(stream, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("<< Company.Name >>");
docText = regexText.Replace(docText, "Company 123");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
wordDoc.Save();
}
updated = stream.ToArray();
}
The search text is not being found/replaced, which I am assuming is because of the way everything is stored separately in the XML, but how would I go about replacing a field like this?
Thanks
Ryan
With OpenXML SDK you can use this SearchAndReplace - Youtube, note that it's a screen cast that shows the algorithm that can be used to accomplish the replacement over multiple <w:run> elements.
An alternative approach would be to use pure .NET solution, like this Find and Replace text in a Word document.
Last, the easiest and straightforward approach would be to use some other library, for instance check this example of Find and Replace with GemBox.Document.
Below is the code that creates a pdf to write a file..Every time i call the below code it creates a pdf file to write into..My question is,is there a same method for exporting to word or for simplicity just creates a blank doc file so that i can export data into it..
public void showPDf() {
iTextSharp.text.Document doc = new iTextSharp.text.Document(
iTextSharp.text.PageSize.A4);
string combined = Path.Combine(txtPath.Text,".pdf");
PdfWriter pw = PdfWriter.GetInstance(doc, new FileStream(combined, FileMode.Create));
doc.Open();
}
1. Interop API
It is available in Namespace Microsoft.Office.Interop.Word.
You can use Word Interop COM API to do that using following code,
// Open a doc file.
Application application = new Application();
Document document = application.Documents.Open("C:\\word.doc");
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
// Write the word.
string text = document.Words[i].Text;
Console.WriteLine("Word {0} = {1}", i, text);
}
// Close word.
application.Quit();
Only Drawback is you must have office installed to use this feature.
2. OpenXML
you can use openxml to build word documents, try the following link,
http://msdn.microsoft.com/en-us/library/bb264572(v=office.12).aspx
Did you try searching the web for this ?
How to automate Microsoft Word to create a new document by using Visual C#
There is a free solution to export data to word,
http://www.codeproject.com/Articles/151789/Export-Data-to-Excel-Word-PDF-without-Automation-f
I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)
token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
I can meassure the length of the content but I cannot get the actual string content.
I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.
Now the question is, How can I extract text regardless of the font setting?
Thanks
complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}
Check out PdfTextExtractor.
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum);
or
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());
Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.
PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.
Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.
string strFile = #"C:\my\path\tothefile.pdf";
iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
int pageFrom = 1;
int pageTo = pdfRida.NumberOfPages;
iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
string tknValue;
for (int i = pageFrom; i <= pageTo; i++)
{
iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if(cannots!=null)
foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList)
{
iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);
iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString))
{
string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
if (cStringVal.ToUpper().Contains("LOS 8"))
{ // DO SOMETHING FUN
}
}
}
}
pdfRida.Close();