MS Word - Insert Text in worddocument after specified text

MS Word - Insert Text in worddocument after specified text - c#

found several matches to my request, but not for word, so I open a new thread.
I got a word document. After a specified text, I want to insert some lines.
My code looks like that:
public static class Program
{
static void Main(string[] args)
{
string Filename = #"D:\...\MasterDoc.docx";
string SearchFor = "Search this text and insert after it";
string DocText = string.Empty;
int InsertIndex = 0;
//Run or attach MS Word
Microsoft.Office.Interop.Word.Application wrdApp = RunOrAttachWordApplication();
//Open masterfile
Microsoft.Office.Interop.Word.Document doc = wrdApp.Documents.Open(Filename);
//Get complete range
Microsoft.Office.Interop.Word.Range rng = doc.Range();
//Get document text
DocText = rng.Text;
//Search indes to insert text
InsertIndex = DocText.IndexOf(SearchFor) + SearchFor.Length;
//Define range at location for text pasting
rng = doc.Range(InsertIndex, InsertIndex + 1);
//Write 'Test...' on specified location
rng.InsertAfter("Test...");
//Close document
doc.Close(); /*Right here I got a breakpoint and watch the result, so I do not need to save the document here*/
//Close application
wrdApp.Application.Quit();
wrdApp.Quit();
wrdApp = null;
System.GC.Collect();
System.Console.WriteLine("Ende...");
System.Console.ReadKey();
}
public static Microsoft.Office.Interop.Word.Application RunOrAttachWordApplication()
{
if(System.Diagnostics.Process.GetProcessesByName("Word").Length > 0)
{
return System.Runtime.InteropServices.Marshal.GetActiveObject("Word.Application") as Microsoft.Office.Interop.Word.Application;
}
else
{
return new Microsoft.Office.Interop.Word.Application();
}
}
}
Well, it works - but not correctly. The text is inserted about 50 digits before the location, I want.
Does anybody know, how to fix that or I could imagine, that there is a much better methode to do that. I cannot modify the Masterdocument as it is recreated everytime by another, external program.
Thank you very much and regards,
Jan

I find the Find.Execute method much cleaner and shorter for adding/replacing text in the Word VSTO.
doc.Range().Find.Execute(FindText: SearchFor, Replace: WdReplace.wdReplaceOne, ReplaceWith: SearchFor + " Test...");
It has many options, but for replacing the text I used:
FindText - the text to find.
Replace - how many replacements to make. [WdReplace][2]
ReplaceWith - the text.

I suspect that there is some metadata at the beginning of the file which is being ignored when you lookup the required index here:
DocText.IndexOf(SearchFor) + SearchFor.Length;
You can add the discrepancy to the insert, given that the amount is constant for all files:
//Search indes to insert text
InsertIndex = DocText.IndexOf(SearchFor) + SearchFor.Length + 50;
However, the most eloquent solution would be to stop using rng.InsertAfter and instead set rng.Text like this:
//Get document text
DocText = rng.Text;
//Search indes to insert text
InsertIndex = DocText.IndexOf(SearchFor) + SearchFor.Length;
//Update the text
DocText = DocText.Substring(0, InsertIndex) + "Test..." + DocText.Substring(InsertIndex);
// Set the new text
rng.Text = DocText;

Will unfortunatelly, the solution of EyIM does not work, as the textlength increased and its limited to 255 characters. Any other solutions? I really want to insert text, as there is some format in the document and I need to keep that.

Related

C# Word Document Replace PlainText with mergefield

I've got a word document template and a CSV i would like to mailmerge it with.
In the word document i have text surrounded with <<>> if i want to use it to mailmerge, this matches the headers in my csv. For example i have <<Salutation>> in my word document and the field name Salutation in my csv.
Is there an easy way to replace the text surrounded by <<>> with a mailmerge field corresponding to its header in the CSV?
The code i have so far for reading the data in is:
Microsoft.Office.Interop.Word.Application _wordApp = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document oDoc = _wordApp.Documents.Add(#"C:\Eyre\Template.docx");
_wordApp.Visible = true;
oDoc.MailMerge.MainDocumentType = Microsoft.Office.Interop.Word.WdMailMergeMainDocType.wdFormLetters;
oDoc.MailMerge.OpenDataSource(#"C:\Eyre\CSV.csv", false, false, true);
oDoc.MailMerge.Destination = Microsoft.Office.Interop.Word.WdMailMergeDestination.wdSendToNewDocument;
oDoc.MailMerge.Execute(false);
Microsoft.Office.Interop.Word.Document oLetters = _wordApp.ActiveDocument;
oLetters.SaveAs2(#"C:\Eyre\letters.docx",
Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatDocumentDefault);
Any help would be much appreciated
---EDIT---
This seems to be confusing some people. I have a word template with plain text such as Salutation and need a C# program that will replace this plain text with a merge field from a csv.

Here's a C# version of code to replace "placeholders" in a Word document with merge fields. (For readers looking for a VB-version, see https://stackoverflow.com/a/50159375/3077495.)
My code uses an already running instance of Word, so the part that interests you starts at foreach (Word.MailMergeDataField...
The Find/Replace actions are in their own procedure ReplaceTextWithMergeField, to which the name of the data source field (as Word sees it!), and the target Range for the search are passed.
Note how the angled bracket pairs are appended to the data field name in this procedure.
The Find/Replace actions are standard, re-setting the Range object of continuing the search for a data field name is a bit different because it's necessary to get the position outside the merge field - after inserting the field the Range is inside the field code. If this isn't done, Find could end up in the same field "infinitely". (Note: Not in this case, with the double angled brackets. But if anyone were to use the code without them, then the problem would occur.)
EDIT: In order to find and replace in Shape objects, those objects must be looped separately. Anything formatted with text wrapping is in a different layer of the document and is not part of Document.Content. I've adapted the find procedure in a third procedure to search through the document's ShapeRange, testing for Text Boxes, specifically.
private void btnDataToMergeFields_Click(object sender, EventArgs e)
{
getWordInstance();
if (wdApp != null)
{
if (wdApp.Documents.Count > 0)
{
Word.Document doc = wdApp.ActiveDocument;
Word.Range rng = doc.Content;
Word.ShapeRange rngShapes = rng.ShapeRange;
if (doc.MailMerge.MainDocumentType != Word.WdMailMergeMainDocType.wdNotAMergeDocument)
foreach (Word.MailMergeDataField mmDataField in doc.MailMerge.DataSource.DataFields)
{
System.Diagnostics.Debug.Print(ReplaceTextWithMergeField(mmDataField.Name, ref rng).ToString()
+ " merge fields inserted for " + mmDataField.Name);
rng = doc.Content;
System.Diagnostics.Debug.Print(ReplaceTextWithMergeFieldInShapes(mmDataField.Name, ref rngShapes)
+ " mergefields inserted for " + mmDataField.Name);
}
}
}
}
//returns the number of times the merge field was inserted
public int ReplaceTextWithMergeField(string sFieldName, ref Word.Range oRng)
{
int iFieldCounter = 0;
Word.Field fldMerge;
bool bFound;
oRng.Find.ClearFormatting();
oRng.Find.Forward = true;
oRng.Find.Wrap = Word.WdFindWrap.wdFindStop;
oRng.Find.Format = false;
oRng.Find.MatchCase = false;
oRng.Find.MatchWholeWord = false;
oRng.Find.MatchWildcards = false;
oRng.Find.MatchSoundsLike = false;
oRng.Find.MatchAllWordForms = false;
oRng.Find.Text = "<<" + sFieldName + ">>";
bFound = oRng.Find.Execute();
while (bFound)
{
iFieldCounter = iFieldCounter + 1;
fldMerge = oRng.Fields.Add(oRng, Word.WdFieldType.wdFieldMergeField, sFieldName, false);
oRng = fldMerge.Result;
oRng.Collapse(Word.WdCollapseDirection.wdCollapseEnd);
oRng.MoveStart(Word.WdUnits.wdCharacter, 2);
oRng.End = oRng.Document.Content.End;
oRng.Find.Text = "<<" + sFieldName + ">>";
bFound = oRng.Find.Execute();
}
return iFieldCounter;
}
public int ReplaceTextWithMergeFieldInShapes(string sFieldName,
ref Word.ShapeRange oRng)
{
int iFieldCounter = 0;
Word.Field fldMerge;
bool bFound;
foreach (Word.Shape shp in oRng)
{
if (shp.Type == Office.MsoShapeType.msoTextBox)
{
Word.Range rngText = shp.TextFrame.TextRange;
rngText.Find.ClearFormatting();
rngText.Find.Forward = true;
rngText.Find.Wrap = Word.WdFindWrap.wdFindStop;
rngText.Find.Format = false;
rngText.Find.MatchCase = false;
rngText.Find.MatchWholeWord = false;
rngText.Find.MatchWildcards = false;
rngText.Find.MatchSoundsLike = false;
rngText.Find.MatchAllWordForms = false;
rngText.Find.Text = "<<" + sFieldName + ">>";
bFound = rngText.Find.Execute();
while (bFound)
{
iFieldCounter = iFieldCounter + 1;
fldMerge = rngText.Fields.Add(rngText, Word.WdFieldType.wdFieldMergeField, sFieldName, false);
rngText = fldMerge.Result;
rngText.Collapse(Word.WdCollapseDirection.wdCollapseEnd);
rngText.MoveStart(Word.WdUnits.wdCharacter, 2);
rngText.End = shp.TextFrame.TextRange.End;
rngText.Find.Text = sFieldName;
bFound = rngText.Find.Execute();
}
}
}
return iFieldCounter;
}

There are a number of approaches depending on your broader requirements. If you will run the tool as needed for simple / small tasks on your Windows machine, then the VBA/macro approach is probably best since you already have the things in place you need.
Another approach requires more coding and understanding of DOCX, but you could potentially scale it and run it on machines without MS Office libraries. Since DocX is open and text based, you can unzip it process the XML contents and re-zip. There are some gotchas because the XML is not trivial. If you were doing this, using Word Merge Fields is better (for the programmer) than plain text since finding fields is simpler. Plain text is better for the person working with the Document/Template since they don't have to deal with merge fields, but the downside is the XML processing can become much more complicated. The text in the template <<Salutation>> might not be easy to find the XML - it could be split into pieces.
Another solution is to use something like Docmosis (a commercial product - please note I work for Docmosis). The upsides are that Docmosis can do the replacement and more complex requirements (conditional and looping structures, PDF conversion for example). The downside is you have to learn the API and install the software (or call out to the cloud) and also get your data into a format to pass to the engine.
I hope that helps.

Writing Japanese string to excel using OpenXml

I am trying to create Excel file by reading data from the database. One of the columns contains Japanese text. While writing that column to excel cell and saving workbook gives following error (which makes sense as the characters are not valid xml chars ) :'', hexadecimal value 0x0B, is an invalid character.
I am writing the string as following to the excel cell using DocumentFormat.OpenXml package.
var excelCell = new Cell();
var cellValue = dtRow[col.Name].ToString();
var inlineStr = new InlineString(new Text(cellValue));
excelCell.DataType = CellValues.InlineString;
excelCell.InlineString = inlineStr;
What needs to be done to write Japanese characters to the excel using OpenXml in C#

Ok. Found the right way. Putting it as answer so that it can be helpful.
To add text to excel which is not allowed as valid xml, add the text as SharedString to the SharedStringTable
var index = InsertSharedStringItem(text, shareStringPart);
excelCell.CellValue = new CellValue(index.ToString());
excelCell.DataType = new EnumValue<CellValues>(CellValues.SharedString);
private static int InsertSharedStringItem(string text, SharedStringTablePart shareStringPart)
{
// If the part does not contain a SharedStringTable, create one.
if (shareStringPart.SharedStringTable == null)
{
shareStringPart.SharedStringTable = new SharedStringTable();
}
int i = 0;
// Iterate through all the items in the SharedStringTable. If the text already exists, return its index.
foreach (SharedStringItem item in shareStringPart.SharedStringTable.Elements<SharedStringItem>())
{
if (item.InnerText == text)
{
return i;
}
i++;
}
// The text does not exist in the part. Create the SharedStringItem and return its index.
shareStringPart.SharedStringTable.AppendChild(new SharedStringItem(new DocumentFormat.OpenXml.Spreadsheet.Text(text)));
shareStringPart.SharedStringTable.Save();
return i;
}
Full documentation for adding text as shared string to excel using OpenXml
https://msdn.microsoft.com/en-us/library/office/cc861607.aspx

How to Extract Docx Pages according to page range using c#

Hi Please help to extracting pages from docx file according to page range like 2 - 4 or 10 - 15. i am
using mentioned but it is not extracting correctly, please correct me where i need to change something code.
public void docx( string path,int pageStart,int pageend)
{
var app = new Application();
app.Visible = true;
var doc = app.Documents.Open(path);
//This Range object will contain each page.
var page = doc.Range(pageStart, pageend);
if (pageStart < pageend)
{
page.End = page.GoTo(What: WdGoToItem.wdGoToPage, Which: WdGoToDirection.wdGoToAbsolute, Count: pageStart + pageend).Start - pageStart;
}
else
{
page.End = doc.Range().End;
}
//Copy and paste the contents of the Range into a new document
page.Copy();
var doc2 = app.Documents.Add();
doc2.Range().Paste();
}

This works for me
var range = doc.Range();
range.Start = doc.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, pageStart).Start;
if (pageend < doc.ComputeStatistics(WdStatistic.wdStatisticPages, false))
{
range.End = doc.GoTo(WdGoToItem.wdGoToPage, WdGoToDirection.wdGoToAbsolute, pageend + 1).End - 1;
}
range.Copy();
The new range select the entire document, so its End is already the document's end. The start is set according to the start of the start page you need. The end is set as the beginning of page (pageend + 1), minus 1 character (to get back). This will bring us to the end of page pageend. This is only done if pageend is not the last page.
We could fit it all inside the range initialization, but that will make the code unreadable.

C# get text from file between two hashes

In my C# program (at this point) I have two fields in my form. One is a word list using a listbox; the other is a textbox. I have been able to successfully load a large word list into the listbox from a text file. I can also display the selected item in the listbox into the textbox this way:
private void wordList_SelectedIndexChanged(object sender, EventArgs e)
{
string word = wordList.Text;
concordanceDisplay.Text = word;
}
I have another local file I need to get at to display some of its contents in the textbox. In this file each headword (as in a dictionary) is preceded by a #. So, I would like to take the variable 'word' and search in this local file to put the entries into the textbox, like so:
#headword1
entry is here...
...
...
#headword2
entry is here...
...
...
#headword3
entry is here...
...
...
You get the format of the text file. I just need to search for the correct headword with # before that word, and copy all info from there until the next hash in the file, and place it in the text box.
Obviously, I am a newbie, so be gentle. Thanks much.
P.S. I used StreamReader to get at the word list and display it in the listbox like so:
StreamReader sr = new StreamReader("C:\\...\\list-final.txt");
string line;
while ((line = sr.ReadLine()) != null)
{
MyList.Add(line);
}
wordList.DataSource = MyList;

var sectionLines = File.ReadAllLines(fileName) // shortcut to read all lines from file
.SkipWhile(l => l != "#headword2") // skip everything before the heading you want
.Skip(1) // skip the heading itself
.TakeWhile(l => !l.StartsWith("#")) // grab stuff until the next heading or the end
.ToList(); // optional convert to list

string getSection(string sectionName)
{
StreamReader sr = new StreamReader(#"C:\Path\To\file.txt");
string line;
var MyList = new List<string>();
bool inCorrectSection = false;
while ((line = sr.ReadLine()) != null)
{
if (line.StartsWith("#"))
{
if (inCorrectSection)
break;
else
inCorrectSection = Regex.IsMatch(line, #"^#" + sectionName + #"($| -)");
}
else if (inCorrectSection)
MyList.Add(line);
}
return string.Join(Environment.NewLine, MyList);
}
// in another method
textBox.Text = getSection("headword1");
Here are a few alternate ways to check if the section matches, in rough order of how accurate they are in detecting the right section name:
// if the separator after the section name is always " -", this is the best way I've thought of, since it will work regardless of what's in the sectionName
inCorrectSection = Regex.IsMatch(line, #"^#" + sectionName + #"($| -)");
// as long as the section name can't contain # or spaces, this will work
inCorrectSection = line.Split('#', ' ')[1] == sectionName;
// as long as only alphanumeric characters can ever make up the section name, this is good
inCorrectSection = Regex.IsMatch(line, #"^#" + sectionName + #"\b");
// the problem with this is that if you are searching for "head", it will find "headOther" and think it's a match
inCorrectSection = line.StartsWith("#" + sectionName);

How do I grab the text and formatting from a PDF table with PDFBox? [duplicate]

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):
+----------------------------------------------------------------+
| AIH | Value | Complexity | Financing |
| | | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34 | | | 12.34 | |
+----------------------------------------------------------------+
| abc | 1.56 | | 1.56 | | | 1.56|
+----------------------------------------------------------------+
Then I use PDFBox:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Those two lines of data would be extracted like this:
xyz 12.43 12.4312.43
abc 1.56 1.561.56
There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.
It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.
I suggest that you extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.
Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.
Good luck.

You can extract text by area in PDFBox. See the ExtractByArea.java example file, in the pdfbox-examples artifact if you're using Maven. A snippet looks like
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect = new Rectangle( 464, 59, 55, 5);
stripper.addRegion( "class1", rect );
stripper.extractRegions( page );
String string = stripper.getTextForRegion( "class1" );
The problem is getting the coordinates in the first place. I've had success extending the normal TextStripper, overriding processTextPosition(TextPosition text) and printing out the coordinates for each character and figuring out where in the document they are.
But there's a much simpler way, at least if you're on a Mac. Open the PDF in Preview, ⌘I to show the Inspector, choose the Crop tab and make sure the units are in Points, from the Tools menu choose Rectangular selection, and select the area of interest. If you select an area, the inspector will show you the coordinates, which you can round and feed into the Rectangle constructor arguments. You just need to confirm where the origin is, using the first method.

I had used many tools to extract table from pdf file but it didn't work for me.
So i have implemented my own algorithm ( its name is traprange ) to parse tabular data in pdf files.
Following are some sample pdf files and results:
Input file: sample-1.pdf, result: sample-1.html
Input file: sample-4.pdf, result: sample-4.html
Visit my project page at traprange.

It may be too late for my answer, but I think this is not that hard. You can extend the PDFTextStripper class and override the writePage() and processTextPosition(...) methods. In your case I assume that the column headers are always the same. That means that you know the x-coordinate of each column heading and you can compare the the x-coordinate of the numbers to those of the column headings. If they are close enough (you have to test to decide how close) then you can say that that number belongs to that column.
Another approach would be to intercept the "charactersByArticle" Vector after each page is written:
#Override
public void writePage() throws IOException {
super.writePage();
final Vector<List<TextPosition>> pageText = getCharactersByArticle();
//now you have all the characters on that page
//to do what you want with them
}
Knowing your columns, you can do your comparison of the x-coordinates to decide what column every number belongs to.
The reason you don't have any spaces between numbers is because you have to set the word separator string.
I hope this is useful to you or to others who might be trying similar things.

There's PDFLayoutTextStripper that was designed to keep the format of the data.
From the README:
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class Test {
public static void main(String[] args) {
String string = null;
try {
PDFParser pdfParser = new PDFParser(new FileInputStream("sample.pdf"));
pdfParser.parse();
PDDocument pdDocument = new PDDocument(pdfParser.getDocument());
PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();
string = pdfTextStripper.getText(pdDocument);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
};
System.out.println(string);
}
}

I've had decent success with parsing text files generated by the pdftotext utility (sudo apt-get install poppler-utils).
File convertPdf() throws Exception {
File pdf = new File("mypdf.pdf");
String outfile = "mytxt.txt";
String proc = "/usr/bin/pdftotext";
ProcessBuilder pb = new ProcessBuilder(proc,"-layout",pdf.getAbsolutePath(),outfile);
Process p = pb.start();
p.waitFor();
return new File(outfile);
}

Try using TabulaPDF (https://github.com/tabulapdf/tabula) . This is very good library to extract table content from the PDF file. It is very as expected.
Good luck. :)

Extracting data from PDF is bound to be fraught with problems. Are the documents created through some kind of automatic process? If so, you might consider converting the PDFs to uncompressed PostScript (try pdf2ps) and seeing if the PostScript contains some sort of regular pattern which you can exploit.

I had the same problem in reading the pdf file in which data is in tabular format. After regular parse using PDFBox each row were extracted with comma as a separator... losing the columnar position.
To resolve this I used PDFTextStripperByArea and using coordinates I extracted the data column by column for each row. This is provided that you have a fixed format pdf.
File file = new File("fileName.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition( true );
Rectangle rect1 = new Rectangle( 50, 140, 60, 20 );
Rectangle rect2 = new Rectangle( 110, 140, 20, 20 );
stripper.addRegion( "row1column1", rect1 );
stripper.addRegion( "row1column2", rect2 );
List allPages = document.getDocumentCatalog().getAllPages();
PDPage firstPage = (PDPage)allPages.get( 2 );
stripper.extractRegions( firstPage );
System.out.println(stripper.getTextForRegion( "row1column1" ));
System.out.println(stripper.getTextForRegion( "row1column2" ));
Then row 2 and so on...

You can use PDFBox's PDFTextStripperByArea class to extract text from a specific region of a document. You can build on this by identifying the region each cell of the table. This isn't provided out of the box, but the example DrawPrintTextLocations class demonstrates how you can parse the bounding boxes of individual characters in a document (it would be great to parse bounding boxes of strings or paragraphs, but I haven't seen support in PDFBox for this - see this question). You can use this approach to group up all touching bounding boxes to identify distinct cells of a table. One way to do this is to maintain a set boxes of Rectangle2D regions and then for each parsed character find the character's bounding box as in DrawPrintTextLocations.writeString(String string, List<TextPosition> textPositions) and merge it with the existing contents.
Rectangle2D bounds = s.getBounds2D();
// Pad sides to detect almost touching boxes
Rectangle2D hitbox = bounds.getBounds2D();
final double dx = 1.0; // This value works for me, feel free to tweak (or add setter)
final double dy = 0.000; // Rows of text tend to overlap, so no need to extend
hitbox.add(bounds.getMinX() - dx , bounds.getMinY() - dy);
hitbox.add(bounds.getMaxX() + dx , bounds.getMaxY() + dy);
// Find all overlapping boxes
List<Rectangle2D> intersectList = new ArrayList<Rectangle2D>();
for(Rectangle2D box: boxes) {
if(box.intersects(hitbox)) {
intersectList.add(box);
}
}
// Combine all touching boxes and update
for(Rectangle2D box: intersectList) {
bounds.add(box);
boxes.remove(box);
}
boxes.add(bounds);
You can then pass these regions to PDFTextStripperByArea.
You can also go one further and separate out the horizontal and vertical components of these regions, and so infer regions of all the table's cells, regardless of whether then hold any content.
I have had cause to perform these steps, and eventually wrote my own PDFTableStripper class using PDFBox. I've shared my code as a gist on GitHub. The main method gives an example of how the class can be used:
try (PDDocument document = PDDocument.load(new File(args[0])))
{
final double res = 72; // PDF units are at 72 DPI
PDFTableStripper stripper = new PDFTableStripper();
stripper.setSortByPosition(true);
// Choose a region in which to extract a table (here a 6"wide, 9" high rectangle offset 1" from top left of page)
stripper.setRegion(new Rectangle(
(int) Math.round(1.0*res),
(int) Math.round(1*res),
(int) Math.round(6*res),
(int) Math.round(9.0*res)));
// Repeat for each page of PDF
for (int page = 0; page < document.getNumberOfPages(); ++page)
{
System.out.println("Page " + page);
PDPage pdPage = document.getPage(page);
stripper.extractTable(pdPage);
for(int c=0; c<stripper.getColumns(); ++c) {
System.out.println("Column " + c);
for(int r=0; r<stripper.getRows(); ++r) {
System.out.println("Row " + r);
System.out.println(stripper.getText(r, c));
}
}
}
}

It is not required for me to use the PDFBox library, so a solution that uses another library is fine
Camelot and Excalibur
You may want to try Python library Camelot, an open source library for Python. If you are not inclined to write code, you may use the web interface Excalibur created around Camelot. You "upload" the document to a localhost web server, and "download" the result from this localhost server.
Here is an example from using this python code:
import camelot
tables = camelot.read_pdf('foo.pdf', flavor="stream")
tables[0].to_csv('foo.csv')
The input is a pdf containing this table:
Sample table from the PDF-TREX set
No help is provided to camelot, it is working on its own by looking at pieces of text relative alignment. The result is returned in a csv file:
PDF table extracted from sample by camelot
"Rules" can de added to help camelot identify where are fillets in sophisticated tables:
Rule added in Excalibur. Source
GitHub:
Camelot: https://github.com/camelot-dev/camelot
Excalibur: https://github.com/camelot-dev/excalibur
The two projects are active.
Here is a comparison with other software (with test based on actual documents), Tabula, pdfplumber, pdftables, pdf-table-extract.
I want is to be able to parse the file and know what each parsed number means
You cannot do that automatically, as pdf is not semantically structured.
Book versus document
Pdf "documents" are unstructured from a semantic standpoint (it's like a notepad file), the pdf document gives instructions on where to print a text fragment, unrelated to other fragments of the same section, there is no separation between content (what to print, and whether this is a fragment of a title, a table or a footnote) and the visual representation (font, location, etc). Pdf is an extension of PostScript, which describes a Hello world! page this way:
!PS
/Courier % font
20 selectfont % size
72 500 moveto % current location to print at
(Hello world!) show % add text fragment
showpage % print all on the page
(Wikipedia).
One can imagine what a table looks like with the same instructions.
We could say html is not clearer, however there is a big difference: Html describes the content semantically (title, paragraph, list, table header, table cell, ...) and associates the css to produce a visual form, hence content is fully accessible. In this sense, html is a simplified descendant of sgml which puts constraints to allow data processing:
Markup should describe a document's structure and other attributes
rather than specify the processing that needs to be performed, because
it is less likely to conflict with future developments.
exactly the opposite of PostScript/Pdf. SGML is used in publishing. Pdf doesn't embed this semantical structure, it carries only the css-equivalent associated to plain character strings which may not be complete words or sentences. Pdf is used for closed documents and now for the so-called workflow management.
After having experimented the uncertainty and difficulty in trying to extract data from pdf, it's clear pdf is not at all a solution to preserve a document content for the future (in spite Adobe has obtained from their pairs a pdf standard).
What is actually preserved well is the printed representation, as the pdf was fully dedicated to this aspect when created. Pdf are nearly as dead as printed books.
When reusing the content matters, one must rely again on manual re-entering of data, like from a printed book (possibly trying to do some OCR on it). This is more and more true, as many pdf even prevent the use of copy-paste, introducing multiple spaces between words or produce an unordered characters gibberish when some "optimization" is done for web use.
When the content of the document, not its printed representation, is valuable, then pdf is not the correct format. Even Adobe is unable to recreate perfectly the source of a document from its pdf rendering.
So open data should never be released in pdf format, this limits their use to reading and printing (when allowed), and makes reuse harder or impossible.

ObjectExtractor oe = new ObjectExtractor(document);
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm(); // Tabula algo.
Page page = oe.extract(1); // extract only the first page
for (int y = 0; y < sea.extract(page).size(); y++) {
System.out.println("table: " + y);
Table table = sea.extract(page).get(y);
for (int i = 0; i < table.getColCount(); i++) {
for (int x = 0; x < table.getRowCount(); x++) {
System.out.println("col:" + i + "/lin:x" + x + " >>" + table.getCell(x, i).getText());
}
}
}

How about printing to image and doing OCR on that?
Sounds terribly ineffective, but it's practically the very purpose of PDF to make text inaccessible, you gotta do what you gotta do.

http://swftools.org/ these guys have a pdf2swf component. They are also able to show tables.
They are also giving the source. So you could possibly check it out.

This works fine if PDF file has "Only Rectangular table" using pdfbox 2.0.6. Won't work with any other table only Rectangular table.
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
public class PDFTableExtractor {
public static void main(String[] args) throws IOException {
ArrayList<String[]> objTableList = readParaFromPDF("C:\\sample1.pdf", 1,1,6);
//Enter Filepath, startPage, EndPage, Number of columns in Rectangular table
}
public static ArrayList<String[]> readParaFromPDF(String pdfPath, int pageNoStart, int pageNoEnd, int noOfColumnsInTable) {
ArrayList<String[]> objArrayList = new ArrayList<>();
try {
PDDocument document = PDDocument.load(new File(pdfPath));
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
tStripper.setStartPage(pageNoStart);
tStripper.setEndPage(pageNoEnd);
String pdfFileInText = tStripper.getText(document);
// split by whitespace
String Documentlines[] = pdfFileInText.split("\\r?\\n");
for (String line : Documentlines) {
String lineArr[] = line.split("\\s+");
if (lineArr.length == noOfColumnsInTable) {
for (String linedata : lineArr) {
System.out.print(linedata + " ");
}
System.out.println("");
objArrayList.add(lineArr);
}
}
}
} catch (Exception e) {
System.out.println("Exception " +e);
}
return objArrayList;
}
}

For anyone wanting to do the same thing as OP (as I do), after days of research Amazon Textract is the best option (if your volume is low free tier might be enough).

consider using PDFTableStripper.class
The class is available on git :
https://gist.github.com/beldaz/8ed6e7473bd228fcee8d4a3e4525be11#file-pdftablestripper-java-L1

I'm not familiar with PDFBox, but you could try looking at itext. Even though the homepage says PDF generation, you can also do PDF manipulation and extraction. Have a look and see if it fits your use case.

For reading content of the table from pdf file,you have to do only just convert the pdf file into a text file by using any API(I have use PdfTextExtracter.getTextFromPage() of iText) and then read that txt file by your java program..now after reading it the major task is done.. you have to filter the data of your need. you can do it by continuously using split method of String class until you find record of your intrest.. here is my code by which I have extract part of record by an PDF file and write it into a .CSV file.. Url of PDF file is..http://www.cea.nic.in/reports/monthly/generation_rep/actual/jan13/opm_02.pdf
Code:-
public static void genrateCsvMonth_Region(String pdfpath, String csvpath) {
try {
String line = null;
// Appending Header in CSV file...
BufferedWriter writer1 = new BufferedWriter(new FileWriter(csvpath,
true));
writer1.close();
// Checking whether file is empty or not..
BufferedReader br = new BufferedReader(new FileReader(csvpath));
if ((line = br.readLine()) == null) {
BufferedWriter writer = new BufferedWriter(new FileWriter(
csvpath, true));
writer.append("REGION,");
writer.append("YEAR,");
writer.append("MONTH,");
writer.append("THERMAL,");
writer.append("NUCLEAR,");
writer.append("HYDRO,");
writer.append("TOTAL\n");
writer.close();
}
// Reading the pdf file..
PdfReader reader = new PdfReader(pdfpath);
BufferedWriter writer = new BufferedWriter(new FileWriter(csvpath,
true));
// Extracting records from page into String..
String page = PdfTextExtractor.getTextFromPage(reader, 1);
// Extracting month and Year from String..
String period1[] = page.split("PEROID");
String period2[] = period1[0].split(":");
String month[] = period2[1].split("-");
String period3[] = month[1].split("ENERGY");
String year[] = period3[0].split("VIS");
// Extracting Northen region
String northen[] = page.split("NORTHEN REGION");
String nthermal1[] = northen[0].split("THERMAL");
String nthermal2[] = nthermal1[1].split(" ");
String nnuclear1[] = northen[0].split("NUCLEAR");
String nnuclear2[] = nnuclear1[1].split(" ");
String nhydro1[] = northen[0].split("HYDRO");
String nhydro2[] = nhydro1[1].split(" ");
String ntotal1[] = northen[0].split("TOTAL");
String ntotal2[] = ntotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("NORTHEN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(nthermal2[4] + ",");
writer.append(nnuclear2[4] + ",");
writer.append(nhydro2[4] + ",");
writer.append(ntotal2[4] + "\n");
// Extracting Western region
String western[] = page.split("WESTERN");
String wthermal1[] = western[1].split("THERMAL");
String wthermal2[] = wthermal1[1].split(" ");
String wnuclear1[] = western[1].split("NUCLEAR");
String wnuclear2[] = wnuclear1[1].split(" ");
String whydro1[] = western[1].split("HYDRO");
String whydro2[] = whydro1[1].split(" ");
String wtotal1[] = western[1].split("TOTAL");
String wtotal2[] = wtotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("WESTERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(wthermal2[4] + ",");
writer.append(wnuclear2[4] + ",");
writer.append(whydro2[4] + ",");
writer.append(wtotal2[4] + "\n");
// Extracting Southern Region
String southern[] = page.split("SOUTHERN");
String sthermal1[] = southern[1].split("THERMAL");
String sthermal2[] = sthermal1[1].split(" ");
String snuclear1[] = southern[1].split("NUCLEAR");
String snuclear2[] = snuclear1[1].split(" ");
String shydro1[] = southern[1].split("HYDRO");
String shydro2[] = shydro1[1].split(" ");
String stotal1[] = southern[1].split("TOTAL");
String stotal2[] = stotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("SOUTHERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(sthermal2[4] + ",");
writer.append(snuclear2[4] + ",");
writer.append(shydro2[4] + ",");
writer.append(stotal2[4] + "\n");
// Extracting eastern region
String eastern[] = page.split("EASTERN");
String ethermal1[] = eastern[1].split("THERMAL");
String ethermal2[] = ethermal1[1].split(" ");
String ehydro1[] = eastern[1].split("HYDRO");
String ehydro2[] = ehydro1[1].split(" ");
String etotal1[] = eastern[1].split("TOTAL");
String etotal2[] = etotal1[1].split(" ");
// Appending filtered data into CSV file..
writer.append("EASTERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(ethermal2[4] + ",");
writer.append(" " + ",");
writer.append(ehydro2[4] + ",");
writer.append(etotal2[4] + "\n");
// Extracting northernEastern region
String neestern[] = page.split("NORTH");
String nethermal1[] = neestern[2].split("THERMAL");
String nethermal2[] = nethermal1[1].split(" ");
String nehydro1[] = neestern[2].split("HYDRO");
String nehydro2[] = nehydro1[1].split(" ");
String netotal1[] = neestern[2].split("TOTAL");
String netotal2[] = netotal1[1].split(" ");
writer.append("NORTH EASTERN" + ",");
writer.append(year[0] + ",");
writer.append(month[0] + ",");
writer.append(nethermal2[4] + ",");
writer.append(" " + ",");
writer.append(nehydro2[4] + ",");
writer.append(netotal2[4] + "\n");
writer.close();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.