Read Pdf file and convert to html elements - c#

We have requirement to read pdf file and need to convert it as html elements( i,e. text, date field, textarea etc). is there any plugin available...or any other method.

Maybe this Information can help you
nuget
PM> Install-Package sautinsoft.pdffocus
example
string pathToPdf = #"d:\Tempos\table.pdf";
string pathToHtml = Path.ChangeExtension(pathToPdf, ".htm");
// Convert PDF file to HTML file
SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
// You may download the latest version of SDK here:
// www.sautinsoft.com/products/pdf-focus/download.php
// Let's force the component to store images inside HTML document
// using base-64 encoding
f.HtmlOptions.IncludeImageInHtml = true;
f.HtmlOptions.Title = "Simple text";
// This property is necessary only for registered version
f.OpenPdf(pathToPdf);
if (f.PageCount > 0)
{
int result = f.ToHtml(pathToHtml);
//Show HTML document in browser
if (result == 0)
{
System.Diagnostics.Process.Start(pathToHtml);
}
}

Related

Extracting and Sorting data from pdf using C# package

I'm working on a project where I have to extract specific text from a pdf so that I can send these info into an excel file.
I tried at first to convert my pdf into a .txt file thinking a .txt file format would be easier to convert into json.
But the result is not at all what I need (dictionary-style Json format) but instead a kind of giant messy string .
The pdf sample looks like this:
Analysis
Some text
Reference Date (Big space) 11/17/2021
Reference Price (Big space) USD 745
Client id (Big space) 4572845
I'd like to have something like this at the end:
{Analysis:Some text, Reference Date:11/17/2021, Reference Price:USD 745, Client id:4572845}
Currently the results give all the info mixed up between each others.
Here is my code:
First, I created a "Global" class where I will create the method "Extract_Row_Info_TS that will basically load the first page of the document (called a TS or Termsheet) and extract the text from the PDF and store it into a txt file called "result.txt":
class Global
{
public static void Extract_RowInfo_TS(string doc_Type, string docPath, int? nbrPage = null)
{
switch (doc_Type)
{
case "Pdf":
Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
doc.LoadFromFile(docPath);
StringBuilder buffer = new StringBuilder();
//Extract text from the first page only
Spire.Pdf.PdfPageBase pagefirst = doc.Pages[0];
buffer.Append(pagefirst.ExtractText());
doc.Close();
//save text
String fileName = #"my_disk:\my_path\result.txt";
File.WriteAllText(fileName, buffer.ToString());
//Load File
System.Diagnostics.Process.Start(fileName);
break;
case "Excel":
Spire.Xls.Workbook Wb = new Spire.Xls.Workbook();
break;
case "Word":
Spire.Doc.Document doc_word = new Spire.Doc.Document();
break;
}
}
}
Come back to my main page, I call the above method "Extract_RowInfo_TS" from above Global class and when it created "result.txt" from the pdf infos, I'll try to convert this "result.txt" into a json format:
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void btn_Extract_PDF_Click(object sender, EventArgs e)
{
Global.Extract_RowInfo_TS("Pdf", #"my_disk:\my_path\my_doc.pdf");
Convert_To_Json_Format(#"my_disk:\my_path\result.txt");
}
private void Convert_To_Json_Format(string baseTextFile)
{
string streamText = new StreamReader(baseTextFile).ReadToEnd();
//Serialize Json Data.
string serializeData = Serialize_into_Json(streamText);
string newFile = #"my_disk:\my_path\NEW_text_file_2.txt";
File.WriteAllText(newFile, serializeData);
System.Diagnostics.Process.Start(newFile);
}
private static string Serialize_into_Json(string json)
{
string jsonData = JsonConvert.SerializeObject(json);
return jsonData;
}
}
I'm stuck here trying to create a proper json format file (or anything alike actually, I just want to group info between them, maybe create a table first ? I don't know...) that I can use for sending into my Excel file. Any help would be much appreciated ! I'm using the Free version of Spire Nuget package v4.3.1 that contains Free Spire.PDF, Spire.Xls, Spire.Doc and more of them. But maybe there are some others solutions out there to achieve the goal I'm looking for.
Thanks in advance for helping and have a great day.

WebBrowser.Document.ExecCommand with "Copy" parameter is not working in C# windows application

I am trying to convert HTML text into RTF in C# windows application.
For that,
I have created one sample windows application in C#.
Used Web Browser control.
Load HTML text into it.
Called web browser's document's object ExecCommand method with "Select" and "Copy" parameter one after the other.
Select command selects the text but Copy command does not copy selected text to the clipboard.
Following is the code that I have used:
//Load HTML text
System.Windows.Forms.WebBrowser webBrowser = new System.Windows.Forms.WebBrowser();
webBrowser.IsWebBrowserContextMenuEnabled = true;
webBrowser.Navigate("about:blank");
webBrowser.Document.Write(htmlText);//htmlText = Valid HTML text
//Copy formatted text from web browser
webBrowser.Document.ExecCommand("SelectAll", false, null);
webBrowser.Document.ExecCommand("Copy", false, null); // NOT WORKING
//Paste copied text from clipboard to Rich Text Box control
using (System.Windows.Forms.RichTextBox objRichTextBox = new System.Windows.Forms.RichTextBox())
{
objRichTextBox.SelectAll();
objRichTextBox.Paste();
string rtfTrxt = objRichTextBox.Rtf;
}
Notes:
I have also marked Main method as a STAThreadAttribute
This is not work on the client system (Windows Server 2019)
Works fine on my system (Windows 7 32 bits)
Browser version is same on my system and client ststem i.e. IE 11
We don't want to use any paid tool like SautinSoft.
I had the same problem.
This helped:
webBrowser.DocumentText = value;
while (webBrowser.DocumentText != value) Application.DoEvents();
webBrowser.Document.ExecCommand("SelectAll", false, null);
webBrowser.Document.ExecCommand("Copy", false, null);
richTextBoxActually.Text = "";
richTextBoxActually.Paste();
Probably it takes several iterations for wb to draw text that can be copied then.
I've used this solution for our task:
// The conversion process will be done completely in memory.
string inpFile = #"..\..\..\example.html";
string outFile = #"ResultStream.rtf";
byte[] inpData = File.ReadAllBytes(inpFile);
byte[] outData = null;
using (MemoryStream msInp = new MemoryStream(inpData))
{
// Load a document.
DocumentCore dc = DocumentCore.Load(msInp, new HtmlLoadOptions());
// Save the document to RTF format.
using (MemoryStream outMs = new MemoryStream())
{
dc.Save(outMs, new RtfSaveOptions() );
outData = outMs.ToArray();
}
// Show the result for demonstration purposes.
if (outData != null)
{
File.WriteAllBytes(outFile, outData);
System.Diagnostics.Process.Start(new System.Diagnostics.ProcessStartInfo(outFile) { UseShellExecute = true });
}

Batch converting files with C# .NET

In the code snippet below, I am requesting the user to input their directory path to target their .pdf file to be converted. However, I would like to be able to convert a batch of .pdf files at once. How could I go about doing this? Say the user has 100 .pdf files in the directory path each with different file names. What is the best way to alter my code to be able to batch convert all the .pdf files at once?
Console.WriteLine("PDF to Excel conversion requires a user directory path");
Console.WriteLine(#"c:\Users\username\Desktop\FolderName\FileName.pdf");
Console.WriteLine("Your Directory Path: ");
var userPath = Console.ReadLine();
string pathToPdf = userPath;
string pathToExcel = Path.ChangeExtension(pathToPdf, ".xls");
// Converting PDF to Excel file
SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
// 'true' = convert data to spreadsheet (tabular and textual)
// 'false' = skip textual data and convert only tabular (tables)
f.ExcelOptions.ConvertNonTabularDataToSpreadsheet = true;
// 'true' = preserve the original page layout
// 'false' = place tables before text
f.ExcelOptions.PreservePageLayout = true;
f.OpenPdf(pathToPdf);
if (f.PageCount > 0)
{
int result = f.ToExcel(pathToExcel);
// open an excel workbook
if (result == 0)
{
System.Diagnostics.Process.Start(pathToExcel);
}
}
Edit: Below you see my attempt to write the program using Bradley's Directory method shown below.
static void Main(string[] args)
{
Console.WriteLine("Welcome. I am Textron's PDF to Excel converter.");
Console.WriteLine("\n - Create a folder with all your .pdf files to be converted");
Console.WriteLine("\n - You must define your directory path");
Console.WriteLine(#" For Example ==> c:\Users\Username\Desktop\YourFolder");
Console.WriteLine("\n Your directory: ");
var userPath = Console.ReadLine();
foreach (string file in Directory.EnumerateFiles(userPath, "*.pdf"))
{
string excelPath = Path.ChangeExtension(userPath, ".xls");
// Converting PDF to Excel filetype
SautinSoft.PdfFocus f = new SautinSoft.PdfFocus();
// 'true' = convert data to spreadsheet (tabular and textual)
// 'false' = skip textual data and convert only tabular (tables)
f.ExcelOptions.ConvertNonTabularDataToSpreadsheet = true;
f.OpenPdf(userPath);
if (f.PageCount > 0)
{
int result = f.ToExcel(excelPath);
// open an excel workbook
if (result == 0)
{
System.Diagnostics.Process.Start(excelPath);
}
}
}
}
To get all files in a directory use Directory.EnumerateFiles(MSDN). In your case:
foreach (string file in Directory.EnumerateFiles(directoryPath, "*.pdf"))
{
// PDF code, probably extracted to its own method!
}
In this specific case GetFiles would also work, but EnumerateFiles is better if you only want to do a subset; as it lazily evaluates.

vsto + differentiate attachments

I need to get and save the attachments(s) from a mail item, but using the code below returns all attachments - meaning it also returns the embedded images like the sender's signature with logo which is an image. How can I differentiate a true attachment vs. embedded images? I have seen a lot from forums but it is still unclear to me.
public static void SaveData(MailItem currentMailItem)
{
if (currentMailItem != null)
{
if (currentMailItem.Attachments.Count > 0)
{
for (int i = 1; i <= currentMailItem.Attachments.Count; i++)
{
currentMailItem.Attachments[i].SaveAsFile(#"C:\TestFileSave\" + currentMailItem.Attachments[i].FileName);
}
}
}
}
You can check whether an attachment is inline or not by using the following pseudo-code from MS Technet Forums.
if body format is plain text then
no attachment is inline
else if body format is RTF then
if PR_ATTACH_METHOD value is 6 (ATTACH_OLE) then
attachment is inline
else
attachment is normal
else if body format is HTML then
if PR_ATTACH_FLAGS value has the 4 bit set (ATT_MHTML_REF) then
attachment is inline
else
attachment is normal
You can access the message body format using MailItem.BodyFormat and the MIME attachment properties using Attachment.PropertyAccessor.
string PR_ATTACH_METHOD = 'http://schemas.microsoft.com/mapi/proptag/0x37050003';
var attachMethod = attachment.PropertyAccessor.Get(PR_ATTACH_METHOD);
string PR_ATTACH_FLAGS = 'http://schemas.microsoft.com/mapi/proptag/0x37140003';
var attachFlags = attachment.PropertyAccessor.Get(PR_ATTACH_FLAGS);

Adding a Header and a footer in C# with the output from my program

I am a newbie to C#, I have a "Save to File" option in my program which saves the output of a richtextbox in a word document and when the user chooses this option, I have used saveFileDialogue box for the user to chose the filename and the location.
What I want is that every time when the user chooses this option the word document in which the output is saved has a pre-defined header and footer images...
Thanks a lot for your help in advance!
below is my 'Save to File" code.
private void menuItem7_Click(object sender, EventArgs e)
{
// Create a SaveFileDialog to request a path and file name to save to.
SaveFileDialog saveFile1 = new SaveFileDialog();
// Initialize the SaveFileDialog to specify the RTF extension for the file.
saveFile1.DefaultExt = "*.rtf";
saveFile1.Filter = "RTF Files|*.rtf";
// Determine if the user selected a file name from the saveFileDialog.
if (saveFile1.ShowDialog() == System.Windows.Forms.DialogResult.OK &&
saveFile1.FileName.Length > 0)
{
// Save the contents of the RichTextBox into the file.
richTextBox1.SaveFile(saveFile1.FileName,
RichTextBoxStreamType.PlainText);
}
}
First of all, create a function to take an image, width and height and return the rtf:
This is for a png
public string GetImage(string path, int width, int height)
{
var stream = new MemoryStream();
var img = Image.FromFile(path);
img.Save(stream, System.Drawing.Imaging.ImageFormat.Png);
var bytes = stream.ToArray();
var str = BitConverter.ToString(bytes, 0).Replace("-", string.Empty);
var mpic = #"{\pict\pngblip\picw" + img.Width.ToString() + #"\pich" + img.Height.ToString() +
#"\picwgoa" + width.ToString() + #"\pichgoa" + height.ToString() +
#"\hex " + str + "}";
return mpic;
}
Now you need to insert this 'image' into the right place in the rtf. If you open your rtf file in notepad you should see something like this:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0
Microsoft Sans Serif;}} \viewkind4\uc1\pard\f0\fs17 MYTEXT\par }
If you wanted a quick and dirty method then get the rtf from the richTextBox into a string,
and insert your header image string after the deflang2057 followed by a '/par' to make a new line. Then insert your footer image string just before the closing '}'
something like this:
// Determine if the user selected a file name from the saveFileDialog.
if (saveFile1.ShowDialog() == System.Windows.Forms.DialogResult.OK &&
saveFile1.FileName.Length > 0)
{
var rtf = richTextBox1.Rtf.Insert(richTextBox1.Rtf.IndexOf("deflang2057") + 11, GetImage(#"c:\a.png", 5, 5) + #"\par");
using (var rtfFile = new StreamWriter(saveFile1.FileName))
{
rtfFile.Write(rtf);
}
}
I hope that gets you started.
Here is an example how to use Open XML SDK ...
Source Code
you will have to plug-in your text where it says "Original Text Here".
You have picked a very difficult task for a "newbie". To mix images and text you will need a complex format like PostScript, PDF, DocX, or RTF. Controlling pagination, for example to specify your header and footer images only once and have them automatically show up on the top and bottom of each page, is an even more difficult task.
You have not given us enough information to tell you where to start. For example, what is "my program"? Is it like a word processor? You will have to use the System.Drawing.Printing.PrintDocument classes to define a print document, draw headers and footers when you reach the appropriate place on each page, perform line layout, breaks and pagination. This is a large job for a professional programmer.
Or do you just want to produce a file that another program can output, with headers and footers? You could output RTF; the specification is here. This is an easier task; you may be able to leverage exisiting RTF interpreters.
Or do you want to display these documents on-screen? Han's suggestion to use an existing application via automation is a good one.
Break your task into smaller requirements and investigate each requirement.

Categories