PDF files with XML files attached - c#

HI All,
I have a PDF file with a xml attached, i need to parse the xml file. Does anyone knows how i do that?
I´m using C#.
Thanks in advance.

I believe this blog post describing how read from a PDF file using C# is what you want.
This is the example he gives of grabbing text from the PDF:
using System;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
namespace PDFReader
{
class Program
{
static void Main(string[] args)
{
PDDocument doc = PDDocument.load("lopreacamasa.pdf");
PDFTextStripper pdfStripper = new PDFTextStripper();
Console.Write(pdfStripper.getText(doc));
}
}
}
Here is what looks like an exhaustive and highly organized list of how to read PDFs with C#.
If what you need is some form of embedded meta data, as Mark suggested, I'm sure it's also possible with the to fetch using the tools I've linked to.

Try using LINQ to XML as suggested in this question.

PDF files can have a meta data information object or is it an XML file embedded as an object?

Related

How extract image from DOC file without using of Microsoft.Office.Interop.Word

I'm trying to extract image from *.doc file without using of Microsoft.Office.Interop.Word. I found library like FreeSpire.Doc, but It seems the free version of library isn't able to extract images. Can someone help me with this problem?
[Attached *.doc file with the image I need][1]
[1]: https://mega.nz/#!5nITyQzT!aesEA0akirlpKSEEDceNDjifOAFKlNZSmgTwfhFm36M
Thank you
The only library i found which can extract Images from .doc Document is Aspose. There is an example in their documentation how you can Export Images.
I ended up with this solution. I found a library named GemBox.Document. Unluckily this library is free only for documents containing up to 20 paragraphs. So I had to remove extra paragraphs and then I used this code to get first picture in document.
public void CreateSubnestImageFromNestingReport(string picturePath,string docPath)
{
var fileDir = Path.GetDirectoryName(picturePath);
Directory.CreateDirectory(fileDir);
ComponentInfo.SetLicense("FREE-LIMITED-KEY");
var document = DocumentModel.Load(docPath, LoadOptions.DocDefault);
var pict = document.GetChildElements(true).Single(el => el.ElementType == ElementType.Picture) as Picture;
File.WriteAllBytes(picturePath, pict.PictureStream.ToArray());
}

Generating a DOC or DOCX using MigraDoc

I am working on a project where I need to create a Word file. For this purpose, I am using MigraDoc library for C#.
Using this library, I am easily able to generate a RTF file by writing :
Document document = CreateDocument();
RtfDocumentRenderer rtf = new RtfDocumentRenderer();
rtf.Render(document, "test.rtf", null);
Process.Start("test.rtf");
But the requirement now asks me to get a DOC or DOCX file rather than a RTF file. Is there a way to generate a DOC or DOCX file using MigraDoc? And if so, how may I achieve this?
MigraDoc cannot generate DOC or DOCX files. Since MigraDoc is open source, you could add a renderer for DOCX if you have the knowledge and the time.
MigraDoc as it is cannot generate DOC/DOCX, but maybe you can invoke an external conversion tool after generating the RTF file.
I don't know any such tools. Word can open RTF quickly and so far our customers never complained about getting RTF, not DOC or DOCX.
Update (2019-07-29): The website mentions "Word", but this only refers to RTF. There never was an implementation for .DOC or .DOCX.
It seems no any MigraDoc renders that support DOC or DOCX formats.
On documentation page we can see one MigraDoc feature:
Supports different output formats (PDF, Word, HTML, any printer supported by Windows)
But seems documentation says about RTF format that perfectly works with Word. I have reviewed MigraDoc repository and I do not see any DOC renders. We can use only RTF converter for Word supporting. So we can't generate DOC file directly using this package.
But we can convert RTF to DOC or DOCX easily (and for free) using FreeSpire.Doc nuget package.
Full code example is here:
using MigraDoc.DocumentObjectModel;
using MigraDoc.RtfRendering;
using Spire.Doc;
using System.IO;
namespace MigraDocTest
{
class Program
{
static void Main(string[] args)
{
using (var stream = new MemoryStream())
{
// Generate RTF (using MigraDoc)
var migraDoc = new MigraDoc.DocumentObjectModel.Document();
var section = migraDoc.AddSection();
var paragraph = section.AddParagraph();
paragraph.AddFormattedText("Hello World!", TextFormat.Bold);
var rtfDocumentRenderer = new RtfDocumentRenderer();
rtfDocumentRenderer.Render(migraDoc, stream, false, null);
// Convert RTF to DOCX (using Spire.Doc)
var spireDoc = new Spire.Doc.Document();
spireDoc.LoadFromStream(stream, FileFormat.Auto);
spireDoc.SaveToFile("D:\\example.docx", FileFormat.Docx );
}
}
}
}
You can use Microsoft's DocumentFormat.OpenXML library which has a NuGet package.

How to convert docx to html file using open xml with formatting

I know there are lot of question having same title but I am currently having some issue for them I didn't get the correct way to go.
I am using Open xml sdk 2.5 along with Power tool to convert .docx file to .html file which uses HtmlConverter class for conversion.
I am successfully able to convert the docx file into the Html file but the problem is, html file doesn't retain the original formatting of the document file. eg. Font-size,color,underline,bold etc doesn't reflect into the html file.
Here is my existing code:
public void ConvertDocxToHtml(string fileName)
{
byte[] byteArray = File.ReadAllBytes(fileName);
using (MemoryStream memoryStream = new MemoryStream())
{
memoryStream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
{
HtmlConverterSettings settings = new HtmlConverterSettings()
{
PageTitle = "My Page Title"
};
XElement html = HtmlConverter.ConvertToHtml(doc, settings);
File.WriteAllText(#"E:\Test.html", html.ToStringNewLineOnAttributes());
}
}
}
So I just want to know if is there any way by which I can retain the formatting in converted HTML file.
I know about some third party APIs which does the same thing. But I would prefer if there any way using open xml or any other open source to do this.
PowerTools for Open XML just released a new HtmlConverter module. It now contains an open source, free implementation of a conversion from DOCX to HTML formatted with CSS. The module HtmlConverter.cs supports all paragraph, character, and table styles, fonts and text formatting, numbered and bulleted lists, images, and more. See https://openxmldeveloper.org/
Your end result will not look exactly the way your Word Document turns out, but this link might help.
You might want to find an external tool to help you do this, like Aspose Words
You can use OpenXML Viewer extension for Firefox for Converting with formatting.
http://openxmlviewer.codeplex.com
This works for me. Hope this helps.

OpenXML SDK WordprocessingDocument display in WebBrowser control

I'm currently trying this:
using DocumentFormat.OpenXml.Packaging;
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(fileNameDocx as string, true))
{
var xdoc = wordDoc.MainDocumentPart;
mainWebBrowser.NavigateToString(xdoc.Document.OuterXml.ToString());
}
But this just gives me the text and none of the formatting. Is it possible to show a ".docx" in a webbrowser control like this?
There are some articles related to DOCX to HTML converition:
Transforming Open XML WordprocessingML to XHTML Using the Open XML SDK 2.0
Transforming Open XML WordprocessingML to XHtml
Batch conversion of docx to clean HTML
Please try these approaches above.
Hope that helps
If you'd like to show XML in web browser control, you might need to:
load this XML to XDocument,
prepare HTML template with block,
and then - write formatted XML from XDocument into PRE block.
You could either read the final HTML into string and use 'mainWebBrowser.NavigateToString' as you mentioned, or write HTML file to drive and read it from there into your mainWebBrowser
Hope that helps

Extract Data from .PDF files

I need to extract data from .PDF files and load it in to SQL 2008.
Can any one tell me how to proceed??
Here is an example of how to use iTextSharp to extract text data from a PDF. You'll have to fiddle with it some to make it do exactly what you want, I think it's a good outline. You can see how the StringBuilder is being used to store the text, but you could easily change that to use SQL.
static void Main(string[] args)
{
PdfReader reader = new PdfReader(#"c:\test.pdf");
StringBuilder builder = new StringBuilder();
for (int x = 1; x <= reader.NumberOfPages; x++)
{
PdfDictionary page = reader.GetPageN(x);
IRenderListener listener = new SBTextRenderer(builder);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
PdfDictionary pageDic = reader.GetPageN(x);
PdfDictionary resourcesDic = pageDic.GetAsDict(PdfName.RESOURCES);
processor.ProcessContent(ContentByteUtils.GetContentBytesForPage(reader, x), resourcesDic);
}
}
public class SBTextRenderer : IRenderListener
{
private StringBuilder _builder;
public SBTextRenderer(StringBuilder builder)
{
_builder = builder;
}
#region IRenderListener Members
public void BeginTextBlock()
{
}
public void EndTextBlock()
{
}
public void RenderImage(ImageRenderInfo renderInfo)
{
}
public void RenderText(TextRenderInfo renderInfo)
{
_builder.Append(renderInfo.GetText());
}
#endregion
}
Imagine if you asked this question. How can I load data from arbitrary text files into a SQL table. The challenge isn't opening the text file and reading it, its getting meaningful data out of the files automatically.
So you can use either iText or pdfSharp to read the PDF files, but its the getting meaningful data out that's going to be the challenge.
what you need to do is to use a tool to extract the text from PDF first and then read the file into a binary reader .. then store it into your database .. for extracting the text there are several tools to use. the first to mention are:
iTextsharp which is a Library that can be downloaded and used to do extensive work and in-depth edits and builds when dealing with PDF documents, and there are a lot of examples available online along with a full book that explains the ins and outs of itThe second tool is Adobe PDF iFilter which is a tool from adobe to deal with PDF modifications and manipulation.
Also Foxit iFilter also is a similar assembly that can do just what u r asking for!
PDF Boxwill also serve you!
these are the most well known and well documented ones!
check the following examples:
try the following examples on code project:
Parsing PDF files in .NET using PDFBox and IKVM.NET.A simple class to extract plain text from PDF documents with ITextSharpUsing the IFilter interface to extract text from various document types A parser for PDF Forms written in C#.NET
These do the job and they ain't hard to understand. Hope they help you :-)
A final note: as for me, i would iTextSharp as it's the most well documented library with most available examples.
If you mean metadata, try this question (first answer)
Read/Modify PDF Metadata using iTextSharp
You'll have to do the database stuff yourself though.

Categories