Extract text from pdf by format - c#

I am trying to extract the headlines from pdfs.
Until now I tried to read the plain text and take the first line (which didn't work because in plain text the headlines were not at the beginning) and just read the text from a region (which didn't work, because the regions are not always the same).
The easiest way to do this is in my opinion to read just text with a special format (font, fontsize etc.).
Is there a way to do this?

You can enumerate all text objects on a PDF page using Docotic.Pdf library. For each of the text objects information about the font and the size of the object is available. Below is a sample
public static void listTextObjects(string inputPdf)
{
using (PdfDocument pdf = new PdfDocument(inputPdf))
{
string format = "{0}\n{1}, {2}px at {3}";
foreach (PdfPage page in pdf.Pages)
{
foreach (PdfPageObject obj in page.GetObjects())
{
if (obj.Type != PdfPageObjectType.Text)
continue;
PdfTextData text = (PdfTextData)obj;
string message = string.Format(format, text.Text, text.Font.Name,
text.Size.Height, text.Position);
Console.WriteLine(message);
}
}
}
}
The code will output lines like the following for each text object on each page of the input PDF file.
FACTUUR
Helvetica-BoldOblique, 19.04px at { X=51.12; Y=45.54 }
You can use the retrieved information to find largest text or bold text or text with other properties used to format the headline.
If your PDF is guaranteed to have headline as the topmost text on a page than you can use even simpler approach
public static void printText(string inputPdf)
{
using (PdfDocument pdf = new PdfDocument(inputPdf))
{
foreach (PdfPage page in pdf.Pages)
{
string text = page.GetTextWithFormatting();
Console.WriteLine(text);
}
}
}
The GetTextWithFormatting method returns text in the reading order (i.e from left top to right bottom position).
Disclaimer: I am one of the developer of the library.

Related

WPF RichTextBox: AppendText, TextRange and the trailing newline

I am reading a text file's contents into a RichTextBox like this:
string contents = File.ReadAllText("MyFile.txt");
myRichTextBox.Document.Blocks.Clear();
myRichTextBox.AppendText(contents);
I am using the RichTextBox to automatically apply some syntax highlighting of sorts. When I try reading the unformatted text as described here to save it back to the file, things happen:
A newline (\r\n) is added to the back of the file, which I don't want unless the user explicitly adds this newline.
When I load the file again, the newline is not displayed in the RichTextEdit, even if it is present in the file.
How can I change this, so that the RichTextBox displays and returns exactly the contents of the text file?
The newline \r\n (CR/LF) is part of the text formatting in the RichTextBox control. Each paragraph while converting to the text will be appended by the \r\n.
This is means when a user press the ENTER button a new paragraph with \r\n is adding to the RichTextBox control. And when StringFromRichTextBox() method, described in the Microsoft documentation is used to extract the text content from a RichTextBox it will return a string in which all paragraphs are separated by the \r\n.
The explanations regarding the comments above:
A newline (\r\n) is added to the back of the file, which I don't want unless the user explicitly adds this newline.
A newline \r\n is adding to the end of the file only as a part of the each paragraph ending.
NOTE: If it is necessary to save and thereafter to load the saved document the TextRange.Save() and TextRange.Load() methods can be used:
public void SaveRtf(RichTextBox rtb, string file)
{
var range = new TextRange(rtb.Document.ContentStart, rtb.Document.ContentEnd);
using (var stream = new StreamWriter(file))
{
range.Load(stream.BaseStream, DataFormats.Rtf);
}
}
public void LoadRtf(RichTextBox rtb, string file)
{
var range = new TextRange(rtb.Document.ContentStart, rtb.Document.ContentEnd);
using (var stream = new StreamWriter(file))
{
range.Save(stream.BaseStream, DataFormats.Rtf);
}
}
If to save the whole RuchTextBox content the new TextRange(rtb.Document.ContentStart, rtb.Document.ContentEnd).Text will be used than any text formatting after restoring will be lost.
Could this work? contents.Replace("\r\n", "\n");

Removing watermark in word with OpenXml & C# corrupts document

I have tried following code block to delete the watermark from the document
Code 1:
private static void DeleteCustomWatermark(WordprocessingDocument package, string watermarkId)
{
MainDocumentPart maindoc = package.MainDocumentPart;
if(maindoc!=null)
{
var headers = maindoc.GetPartsOfType<HeaderPart>();
if(headers!=null)
{
var head = headers.First(); //we are sure that this header part contains the Watermark with id=watermarkId
var watermark = head.GetPartById(watermarkId);
if(watermark!=null)
head.DeletePart(watermark);
}
}
}
Code 2:
public static void DeleteCustomWatermark(WordProcessingDocument package, string headerId)
{
//headerId is the id of the header section which contains the watermark
MainDocumentPart = maindoc = package.MainDocumentPart;
if(maindoc!=null)
{
var header = maindoc.HeaderParts.First(i=>maindoc.GetIdOfPart(i).Equals(headerId));
if(header!=null)
maindoc.DeletePart(header)
}
}
I have tried both the code blocks. it removes watermark but leaves the document corrupted. I need to recover after this. After recovery the docs are fine. But I want proper solution so that I can remove watermark with C# code without leaving the document corrupted. Please help.
Thanks
You also need to remove the "Picture" or "Drawing" in the header parts.
e.g.
List<Picture> pictures = new List<Picture>(headerPart.RootElement.Descendants<Picture>());
...
foreach(Picture p in pictures) {
p.Remove();
}
...
headerPart.DeleteParts(imagePartList);

iTextSharp How to read Table in PDF file

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

Thanks in advance.
The Background:
I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).
Description of content being converted to text:
The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.
Image of page layout: http://postimg.org/image/b7i25v0g1/
The Problem:
It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.
I'd end up with text that looks like:
Page 1 Content
First Header
Second Header
Page 1 Content
Page 2 Content
etc.
Here is the pdf: http://www.filedropper.com/dd35-completeadventurer
I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.
static public string ToTxt(string #filePath)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = strText + s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}
As already conjectured in a comment, the duplicate text already is present in the PDF content!
Details
The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.
E.g. consider the two pages 6 and 7. Their contents are identical:
filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:
and page 7:
Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.
Preventing duplicate text in the extraction result
Knowing the cause for the text duplication, there are multiple ways to prevent it:
You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.
Using mkl's second method (checking each page for repeat) I came up with the following and it works brilliantly; an easy fix:
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
string prevPage = "";
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
if (prevPage != s)
strText += s;
prevPage = s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}

Get plain text from an RTF text

I have on my database a column that holds text in RTF format.
How can I get only the plain text of it, using C#?
Thanks :D
Microsoft provides an example where they basically stick the rtf text in a RichTextBox and then read the .Text property... it feels somewhat kludgy, but it works.
static public string ConvertToText(string rtf)
{
using(RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}
for WPF you can use
(using Xceed WPF Toolkit) this extension method :
public static string RTFToPlainText(this string s)
{
// for information : default Xceed.Wpf.Toolkit.RichTextBox formatter is RtfFormatter
Xceed.Wpf.Toolkit.RichTextBox rtBox = new Xceed.Wpf.Toolkit.RichTextBox(new System.Windows.Documents.FlowDocument());
rtBox.Text = s;
rtBox.TextFormatter = new Xceed.Wpf.Toolkit.PlainTextFormatter();
return rtBox.Text;
}
If you want a pure code version, you can parse the rtf yourself and keep only the text bits. It's a bit of work, but not very difficult work - RTF files have a very simple syntax. Read about it in the RTF spec.

Categories