Extracting and Sorting data from pdf using C# package - c#

I'm working on a project where I have to extract specific text from a pdf so that I can send these info into an excel file.
I tried at first to convert my pdf into a .txt file thinking a .txt file format would be easier to convert into json.
But the result is not at all what I need (dictionary-style Json format) but instead a kind of giant messy string .
The pdf sample looks like this:
Analysis
Some text
Reference Date (Big space) 11/17/2021
Reference Price (Big space) USD 745
Client id (Big space) 4572845
I'd like to have something like this at the end:
{Analysis:Some text, Reference Date:11/17/2021, Reference Price:USD 745, Client id:4572845}
Currently the results give all the info mixed up between each others.
Here is my code:
First, I created a "Global" class where I will create the method "Extract_Row_Info_TS that will basically load the first page of the document (called a TS or Termsheet) and extract the text from the PDF and store it into a txt file called "result.txt":
class Global
{
public static void Extract_RowInfo_TS(string doc_Type, string docPath, int? nbrPage = null)
{
switch (doc_Type)
{
case "Pdf":
Spire.Pdf.PdfDocument doc = new Spire.Pdf.PdfDocument();
doc.LoadFromFile(docPath);
StringBuilder buffer = new StringBuilder();
//Extract text from the first page only
Spire.Pdf.PdfPageBase pagefirst = doc.Pages[0];
buffer.Append(pagefirst.ExtractText());
doc.Close();
//save text
String fileName = #"my_disk:\my_path\result.txt";
File.WriteAllText(fileName, buffer.ToString());
//Load File
System.Diagnostics.Process.Start(fileName);
break;
case "Excel":
Spire.Xls.Workbook Wb = new Spire.Xls.Workbook();
break;
case "Word":
Spire.Doc.Document doc_word = new Spire.Doc.Document();
break;
}
}
}
Come back to my main page, I call the above method "Extract_RowInfo_TS" from above Global class and when it created "result.txt" from the pdf infos, I'll try to convert this "result.txt" into a json format:
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void btn_Extract_PDF_Click(object sender, EventArgs e)
{
Global.Extract_RowInfo_TS("Pdf", #"my_disk:\my_path\my_doc.pdf");
Convert_To_Json_Format(#"my_disk:\my_path\result.txt");
}
private void Convert_To_Json_Format(string baseTextFile)
{
string streamText = new StreamReader(baseTextFile).ReadToEnd();
//Serialize Json Data.
string serializeData = Serialize_into_Json(streamText);
string newFile = #"my_disk:\my_path\NEW_text_file_2.txt";
File.WriteAllText(newFile, serializeData);
System.Diagnostics.Process.Start(newFile);
}
private static string Serialize_into_Json(string json)
{
string jsonData = JsonConvert.SerializeObject(json);
return jsonData;
}
}
I'm stuck here trying to create a proper json format file (or anything alike actually, I just want to group info between them, maybe create a table first ? I don't know...) that I can use for sending into my Excel file. Any help would be much appreciated ! I'm using the Free version of Spire Nuget package v4.3.1 that contains Free Spire.PDF, Spire.Xls, Spire.Doc and more of them. But maybe there are some others solutions out there to achieve the goal I'm looking for.
Thanks in advance for helping and have a great day.

Related

Creating files using c#, like an evernote

I currently am making a UI for a note keeper and was just going to preview documents etc, but i was wondering what file type i would need to create if instead i wanted to do things like tag the file etc, preferably in c#, basically make my own evernote, how do these programs store the notes?
I dont know how to directly tag the file, but you could create your own system to do it. I mentioned two ways to do it:
The first way is to format the note's / file's contents so that there are two parts, the tags and the actual text. When the program loads the note / file, it seperates the tags and the text. This has the downside that the program have to load the whole file to just find the tags.
The second way is to have a database with the filename and it's associated tags. In this way the program doesn't have to load the whole file just to find the tags.
The first way
In this solution you need to format your files in a specific way
<Tags>
tag1,tag2,tag3
</Tags>
<Text>
The text you
want in here
</Text>
By setting up the file like this, the program can separate the tags from the text. To load it's tags you'd need this code:
public List<string> GetTags(string filePath)
{
string fileContents;
// read the file if it exists
if (File.Exists(filePath))
fileContents = File.ReadAllText(filePath);
else
return null;
// Find the place where "</Tags>" is located
int tagEnd = fileContents.IndexOf("</Tags>");
// Get the tags
string tagString = fileContents.Substring(6, tagEnd - 6).Replace(Environment.NewLine, ""); // 6 comes from the length of "<Tags>"
return tagString.Split(',').ToList();
}
Then to get the text you'd need this:
public string GetText(string filePath)
{
string fileContents;
// read the file if it exists
if (File.Exists(filePath))
fileContents = File.ReadAllText(filePath);
else
return null;
// Find the place where the text content begins
int textStart = fileContents.IndexOf("<Text>") + 6 + Environment.NewLine.Length; // The length on newLine is neccecary because the line shift after "<Text>" shall NOT be included in the text content
// Find the place where the text content ends
int textEnd = fileContents.LastIndexOf("</Text>");
return fileContents.Substring(textStart, textEnd - textStart - Environment.NewLine.Length); // The length again to NOT include a line shift added earlier by code
}
Then I'll let you find out how you do the rest.
The second way
In this solution you have a database file over all your files and their associated tags. This database file would look like this:
[filename]:[tags]
file.txt:tag1, tag2, tag3
file2.txt:tag4, tag5, tag6
The program will then read the file name and the tags in this way:
public static void LoadDatabase(string databasePath)
{
string[] fileContents;
// End process if database doesn't exist
if (File.Exists(databasePath))
return;
fileContents = File.ReadAllLines(databasePath); // Read all lines seperately and put them into an array
foreach (string str in fileContents)
{
string fileName = str.Split(':')[0]; // Get the filename
string tags = str.Split(':')[1]; // Get the tags
// Do what you must with the information
}
}
I hope this helps.

Can I Convert multiple excel books in one pdf? (no using itextsharp)

I want to convert multiple excel books (not sheets) to 1 PDF file. I don't want to use itextsharp because I need to purchase for commercial.
Does anybody have any idea?
Well, this is a little complex, what I think is that maybe you can convert the excel docs to PDF first and then merge them to a single PDF doc. what is your thought? How is your plan going?
You can refer the following article, the main thought is to convert office files to pdf then merge them.
http://www.dotnetspider.com/resources/46252-Convert-and-Merge-Office-Files-to-One-PDF-File-in-C.aspx
To get better help, perhaps show more information like diNN's comment above is helpful.
Here is what I used:
public static class ExcelMergeExtension
{
public static ExcelFile Merge(this ExcelFile destination, string sourcePath)
{
var sourceFileName = Path.GetFileNameWithoutExtension(sourcePath);
var source = ExcelFile.Load(sourcePath);
foreach (var sourceSheet in source.Worksheets)
destination.Worksheets.AddCopy(
string.Format("{0}-{1}", sourceFileName, sourceSheet.Name),
sourceSheet);
return destination;
}
}
class Program
{
static void Main(string[] args)
{
var options = new PdfSaveOptions() { SelectionType = SelectionType.EntireFile };
ExcelFile.Load("Book1.xlsx")
.Merge("Book2.xlsx")
.Merge("Book3.xlsx")
.Save("Books.pdf", options);
}
}
The code uses GemBox.Spreadsheet library which has free and commercial version, however note that free one does have some size limitations.
Anyway it worked great for me and I hope it helps you too.

Storing Data From WinForms App in .Txt file

I have a very basic C# WinForms application to generate random numbers. The code is shown below:
private static double RandomNumber(double min, double max)
{
Random random = new Random();
var next = random.NextDouble();
return min +(next * (max - min));
}
private void btnGenerate_Click(object sender, EventArgs e)
{
var maxNum = Convert.ToDouble(txbInput.Text);
var randomDec = Math.Round(RandomNumber(0, maxNum), 2);
txbResult.Text = randomDec.ToString();
}
Now what I want do be able to do is on the button click save the random number that is generated in a locally saved file, along with a timestamp.
I am fairly new to C# and have a limited knowledge on how to do this. Therefore any suggestions would be highly appreciated.
These examples show various ways to write text to a file. The first two examples use static methods on the System.IO.File class to write either a complete array of strings or a complete string to a text file. Example #3 shows how to add text to a file when you have to process each line individually before writing to the file. Examples 1-3 all overwrite all existing content in the file. Example #4 shows how to append text to an existing file.
class WriteTextFile
{
static void Main()
{
// These examples assume a "C:\Users\Public\TestFolder" folder on your machine.
// You can modify the path if necessary.
// Example #1: Write an array of strings to a file.
// Create a string array that consists of three lines.
string[] lines = { "First line", "Second line", "Third line" };
// WriteAllLines creates a file, writes a collection of strings to the file,
// and then closes the file.
System.IO.File.WriteAllLines(#"C:\Users\Public\TestFolder\WriteLines.txt", lines);
// Example #2: Write one string to a text file.
string text = "A class is the most powerful data type in C#. Like a structure, " +
"a class defines the data and behavior of the data type. ";
// WriteAllText creates a file, writes the specified string to the file,
// and then closes the file.
System.IO.File.WriteAllText(#"C:\Users\Public\TestFolder\WriteText.txt", text);
// Example #3: Write only some strings in an array to a file.
// The using statement automatically closes the stream and calls
// IDisposable.Dispose on the stream object.
using (System.IO.StreamWriter file = new System.IO.StreamWriter(#"C:\Users\Public\TestFolder\WriteLines2.txt"))
{
foreach (string line in lines)
{
// If the line doesn't contain the word 'Second', write the line to the file.
if (!line.Contains("Second"))
{
file.WriteLine(line);
}
}
}
// Example #4: Append new text to an existing file.
// The using statement automatically closes the stream and calls
// IDisposable.Dispose on the stream object.
using (System.IO.StreamWriter file = new System.IO.StreamWriter(#"C:\Users\Public\TestFolder\WriteLines2.txt", true))
{
file.WriteLine("Fourth line");
}
}
}
//Output (to WriteLines.txt):
// First line
// Second line
// Third line
//Output (to WriteText.txt):
// A class is the most powerful data type in C#. Like a structure, a class defines the data and behavior of the data type.
//Output to WriteLines2.txt after Example #3:
// First line
// Third line
//Output to WriteLines2.txt after Example #4:
// First line
// Third line
// Fourth line
Reference from here
add this:
// using System.IO;
string filepath = #"C:\test.txt"; //sample file name & location
using (StreamWriter writer = new StreamWriter(filepath))
{
writer.WriteLine(DateTime.Now.ToString() + " " + randomDec.ToString());
} // write your text in a string
To save your text to a file you need to use the IO namespace:
System.IO.File.AppendAllText(#"C:\Test.txt", txbResult && DateTime.Now.ToString());
This stuff show you how to write a string value to a file.
EDIT: Added the timestamp value.
From the wise words of MSDN:
// Example #2: Write one string to a text file.
string text = "A class is the most powerful data type in C#. Like a structure, " +
"a class defines the data and behavior of the data type. ";
// WriteAllText creates a file, writes the specified string to the file,
// and then closes the file.
System.IO.File.WriteAllText(#"C:\Users\Public\TestFolder\WriteText.txt", text);
Please refer to the documentation for more details and examples.
Edit:
Mine's missing the time stamp, but there are plenty of worthy answers here that add it :)
private void WriteData(double value)
{
using (var file = new System.IO.StreamWriter(#"C:\file.txt", true))
{
file.WriteLine(string.Format("{0} {1}", value, DateTime.Now));
}
}
You can see this link msdn. Get the time - DateTime.Now.

System.Convert.ToBase64String returning different result than the input to System.Convert.FromBase64String

I'm writing a web application which saves images from the web to a database. Currently, what I use to save images is as follows:
[WebMethod]
public static void Save(string pictures)
{
// pictures is an object containing the URL of the image along with some metadata
List<ImageObject> imageList = new JavaScriptSerializer().Deserialize<List<ImageObject>>(pictures);
for (int i = 0; i < imageList.Count; i++)
{
var webClient = new WebClient();
byte[] byteArray = webClient.DownloadData(imageList[i].URL);
imageList[i].Picture = byteArray;
}
// some SQL to save imageList
}
ImageObject:
[DataContract]
public class ImageObject
{
[DataMember]
public string URL { get; set; }
[DataMember]
public string Picture { get; set; }
}
Now, this works just fine for downloading from a URL. The entry saved to my database:
0xFFD8FFE000104A464946000101[...]331B7CCA7F0AD3A02DC8898F
To display this image, I simply callSystem.Convert.ToBase64String()after retrieving it and use it as an imagesrc. However, I am now trying to add some functionality for users to upload their own pictures. My function for this, called from an<input />:
function uploadPicture(){
var numPics = document.getElementById("uploadedPictures").files.length;
var oFReader = new FileReader();
oFReader.onload = function (oFREvent) {
var src = oFREvent.target.result;
document.getElementById('image').src = src;
}
oFReader.readAsDataURL(document.getElementById("uploadedPictures").files.length;
}
When using this, this is the image source I get from the browser (over 2 MB when the original is only ~500 KB):
data:image/bmp;base64,/9j/4AAQSkZJRgABAgEAY[...]k8tTMZEBCsEbSeRYem493IHAfiHWq6vKlOv/Z
This gets displayed properly when used as a source. However, since this is not an actual URL (andWebClient.DownloadData()is restricted to URLs of < 260 characters), I have been trying other methods to save this data into my database. However, I cannot seem to find a function to save it in the same format asWebClient.DownloadData(). For example, I have looked at converting a base 64 string to an image and saving it, where most of the answers seem to useConvert.FromBase64String(), but using this appears to save in a different format to my database:
0x64006100740061003A0069006D0[...]10051004500420041005100450042004
Trying to useSystem.Convert.ToBase64String()on this returns
ZABhAHQAYQA6AGkAbQBhAGcAZQAvAGIAbQBwADsAYgBh[...]UwBlAFIAWQBlAG0ANAA5ADMASQBIAEEAZgBpAEgAVwBxADYAdgBLAGwATwB2AC8AWgA=
which is different from what I usedConvert.FromBase64String()on. Other things I found on Google to try and get it to save in the same format or display as an image have not worked thus far.
Hence, I am wondering whether there exists a method to convert the result from aFileReaderto the same format asWebClient.DownloadData()does for URLs or if there is some way to convert the0x6400[...]data to a format that can be displayed by using it as an<img>source.
It turns out that the reason the data saved into the database began with0x64006100[....]instead of0xFFD8FFE0[...]was due to me forgetting to strip the URI of the initialdata:image;base64,. After doing this, everything saves and is read properly usingSystem.Convert.ToBase64String().

Display Word Document from Resouce file to RichTextBox Control

I have a Word document imported into the resouce file of my project.
Is it possible to extract this document and display it in the RichTextBox control in my application?
I was able to extract the string and image objects from the resource file of my project using the below class.
namespace TestProject
{
public class Utilities
{
private static ResourceManager _resource = new ResourceManager("TestProject.Resource1", Assembly.GetExecutingAssembly());
public static string GetString(string name)
{
return (System.String)(_resource.GetString(name));
}
public static Image GetImage(string name)
{
return (System.Drawing.Image) (_resource.GetObject(name));
}
}
}
RTF is formatted as a string and if you add it to the Files section of the resources file, it will wrap it with a property to read the string.
That is:
Properties.Resources.YourDocument;
is implemented as:
internal static string YourDocument {
get {
return ResourceManager.GetString("YourDocument", resourceCulture);
}
}
and return rich text looking something like this:
{\rtf1\ansi\ansicpg1252\deff0\deflang3081{\fonttbl{\f0\fnil\fcharset0
Calibri;}} {\colortbl ;\red255\green255\blue0;} {*\generator Msftedit
5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\cf1\lang9\f0\fs22
Rich\cf0 , multiline text.\par \par Is \b\fs32 here\b0\fs22\par }
Leaving you just needing to do:
richTextBox1.Rtf = RichTextResource.Properties.Resources.YourDocument
That assumes the document is actually saved as rich text. A word doc will show up as garbage.
Finally, if your resource is stored as a byte[], you'll need to convert to a string first. I.e.
richTextBox1.Rtf = System.Text.Encoding.UTF8.GetString(bytes), assuming its UTF8 encoded.

Categories