C# OpenXML How to Replace \r\n with Break()? - c#

I have a text field in my database and it has a text with many lines.
When generating a MS Word document using OpenXML and bookmarks, the text become one single line.
I've noticed that in each new line the bookmark value show the characters "\r\n".
Looking for a solution, I've found some answers which helped me, but I'm still having a problem.
I've used the run.Append(new Break()); solution, but the text replaced is showing the name of the bookmark as well.
For example:
bookmark test = "Big text here in first paragraph\r\nSecond paragraph".
It is shown in MS Word document like:
testBig text here in first paragraph
Second paragraph
Can anyone, please, help me to eliminate the bookmark name?
Here is my code:
public void UpdateBookmarksVistoria(string originalPath, string copyPath, string fileType)
{
string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
// Make a copy of the template file.
File.Copy(originalPath, copyPath, true);
//Open the document as an Open XML package and extract the main document part.
using (WordprocessingDocument wordPackage = WordprocessingDocument.Open(copyPath, true))
{
MainDocumentPart part = wordPackage.MainDocumentPart;
//Setup the namespace manager so you can perform XPath queries
//to search for bookmarks in the part.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
//Load the part's XML into an XmlDocument instance.
XmlDocument xmlDoc = new XmlDocument(nt);
xmlDoc.Load(part.GetStream());
//pega a url para exibir as fotos
string url = HttpContext.Current.Request.Url.ToString();
string enderecoURL;
if (url.Contains("localhost"))
enderecoURL = url.Substring(0, 26);
else if (url.Contains("www."))
enderecoURL = url.Substring(0, 24);
else
enderecoURL = url.Substring(0, 20);
//Iterate through the bookmarks.
int cont = 56;
foreach (KeyValuePair<string, string> bookmark in bookmarks)
{
var res = from bm in part.Document.Body.Descendants<BookmarkStart>()
where bm.Name == bookmark.Key
select bm;
var bk = res.SingleOrDefault();
if (bk != null)
{
Run bookmarkText = bk.NextSibling<Run>();
if (bookmarkText != null) // if the bookmark has text replace it
{
var texts = bookmark.Value.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
bookmarkText.Append(new Break());
Text text = new Text();
text.Text = texts[i];
bookmarkText.Append(text); //HERE IS MY PROBLEM
}
}
else // otherwise append new text immediately after it
{
var parent = bk.Parent; // bookmark's parent element
Text text = new Text(bookmark.Value);
Run run = new Run(new RunProperties());
run.Append(text);
// insert after bookmark parent
parent.Append(run);
}
bk.Remove(); // we don't want the bookmark anymore
}
}
//Write the changes back to the document part.
xmlDoc.Save(wordPackage.MainDocumentPart.GetStream(FileMode.Create));
wordPackage.Close();
}}

Related

How to add a paragraph above a bookmark using open xml?

I found the solution here is the code that works to add a paragraph below the bookmark using openxml.Below is the code that works:
var mainPart = wDoc.MainDocumentPart;
var res = from bm in mainPart.Document.Body.Descendants<BookmarkStart>()
where bm.Name == "vendos_tekstin"
select bm;
var bookmark = res.SingleOrDefault();
if (bookmark != null)
{
var parent = bookmark.Parent; // bookmark's parent element
// build paragraph piece by piece
Text text = new Text(DateTime.Now.ToString() + " , ");
Text text1 = new Text(gjenerimi + " , ");
Text text2 = new Text(merreshifren());
var run = new Run();
run.Append(text,text1,text2);
Paragraph newParagraph = new Paragraph(run);
run.PrependChild<RunProperties>(runProp);
// insert after bookmark parent
parent.InsertAfterSelf(newParagraph);

How to Preserve string with formatting in OpenXML Paragraph, Run, Text?

I am following this structure to add text from strings into OpenXML Runs, Which are part of a Word Document.
The string has new line formatting and even paragraph indentions, but these all get stripped away when the text gets inserted into a run. How can I preserve it?
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
String txt = "Some formatted string! \r\nLook there should be a new line here!\r\n\r\nAndthere should be 2 new lines here!"
// Add new text.
Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AppendChild(new Text(txt));
You need to use a Break in order to add new lines, otherwise they will just be ignored.
I've knocked together a simple extension method that will split a string on a new line and append Text elements to a Run with Breaks where the new lines were:
public static class OpenXmlExtension
{
public static void AddFormattedText(this Run run, string textToAdd)
{
var texts = textToAdd.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
run.Append(new Break());
Text text = new Text();
text.Text = texts[i];
run.Append(text);
}
}
}
This can be used like this:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(#"c:\somepath\test.docx", true))
{
var body = wordDoc.MainDocumentPart.Document.Body;
String txt = "Some formatted string! \r\nLook there should be a new line here!\r\n\r\nAndthere should be 2 new lines here!";
// Add new text.
Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AddFormattedText(txt);
}
Which produces the following output:

Detect if a word document is password protected before changing text inside

I am writing a method which will go into a directory find all doc files and substitute specific text inside. For this reason my methods accepts three arguments.
The directory path
The string i want to substitute
The new replacement string
The issue I am facing is when i hit a document which is password protected. I am not able to check if the document is protected before i open it. In this case every time I check the document I get a dialog word windows asking password. I want to check the if the document is protected, if it is just continue with the foreach.
This is my code:
private static void ReplaceString(string folderPath, string findText, string replaceText)
{
// retrieve all doc files from the specified directory
var wordFiles = Directory.GetFiles(folderPath, "*.doc", SearchOption.AllDirectories);
var filtered = wordFiles.Where(f => !f.Contains('$'));
foreach (var wordFilePath in filtered)
{
Console.WriteLine(wordFilePath);
// start a new word application
FileInfo fi = new FileInfo(wordFilePath);
// var wordDocument = new Document();
//checking the current element if: is in use, is readonly, if is protected by password
if (IsLocked(fi))
{
continue;
}
var wordApplication = new Application { Visible = false };
//opening the word document
Document wordDocument = null;
// I want to catch here if the document is protected just to contonie forward
try
{
wordDocument = wordApplication.Documents.Open(wordFilePath, ReadOnly: false, ConfirmConversions: false);
}
catch (COMException e)
{
continue;
}
//Unfolding all fields in a document using ALT + F9
wordDocument.ActiveWindow.View.ShowFieldCodes = true;
// using range class to populate a list of all document members
var range = wordDocument.Range();
try
{
range.Find.Execute(FindText: findText, Replace: WdReplace.wdReplaceAll, ReplaceWith: replaceText);
}
catch (COMException e)
{
continue;
}
// replace searched text
var shapes = wordDocument.Shapes;
foreach (Shape shape in shapes)
{
var initialText = shape.TextFrame.TextRange.Text;
var resultingText = initialText.Replace(findText, replaceText);
shape.TextFrame.TextRange.Text = resultingText;
}
// Show original fields without code
wordDocument.ActiveWindow.View.ShowFieldCodes = false;
// save and close the current document
wordDocument.Save();
wordDocument.Close();
wordApplication.NormalTemplate.Saved = true;
wordApplication.Quit();
// Release this document from memory.
Marshal.ReleaseComObject(wordApplication);
}
}
The Documents.Open() method has a PasswordDocument parameter. if you assign it to a random password (wrong password), the method will just ignore the password you assigned if the document is not password protected. If the document is password protected, the method will throw 5408 exception, which you can catch.
Source

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();
public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}
I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.
All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());
LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class
Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;
Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

Reading PDF content with itextsharp dll in VB.NET or C#

How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
LGPL / FOSS iTextSharp 4.x
var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);
None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to SimpleTextExtractionStrategy or LocationTextExtractionStrategy in the FOSS version.
Something else that might be very useful in conjunction with this:
const string PdfTableFormat = #"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);
List<string> ExtractPdfContent(string rawPdfContent)
{
var matches = PdfTableRegex.Matches(rawPdfContent);
var list = matches.Cast<Match>()
.Select(m => m.Value
.Substring(1) //remove leading (
.Remove(m.Value.Length - 4) //remove trailing )Tj
.Replace(#"\)", ")") //unencode parens
.Replace(#"\(", "(")
.Trim()
)
.ToList();
return list;
}
This will extract the text-only data from the PDF if the text displayed is Foo(bar) it will be encoded in the PDF as (Foo\(bar\))Tj, this method would return Foo(bar) as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.
Here is a VB.NET solution based on ShravankumarKumar's solution.
This will ONLY give you the text. The images are a different story.
Public Shared Function GetTextFromPDF(PdfFileName As String) As String
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim sOut = ""
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
Return sOut
End Function
In my case, I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.
Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner. 72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);
As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however, I was happy that it did preserve the carriage returns. In my case, there were enough constants in the text that I was able to extract the values that I required.
Here an improved answer of ShravankumarKumar. I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
//create a list of pdf pages
var pages = new List<PdfPage>();
//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
//loop all the pages and extract the text
for (int i = 1; i <= reader.NumberOfPages; i++)
{
pages.Add(new PdfPage()
{
content = PdfTextExtractor.GetTextFromPage(reader, i)
});
}
}
//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y =>
new PdfRow() {
content = y,
words = y.Split(' ').ToList()
}
).ToList());
The custom classes
class PdfPage
{
public string content { get; set; }
public List<PdfRow> rows { get; set; }
}
class PdfRow
{
public string content { get; set; }
public List<string> words { get; set; }
}
Now you can get a word by row and word index.
string myWord = pages[0].rows[12].words[4];
Or use Linq to find the rows containing a specific word.
//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();
//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();
Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
Dim sr As StreamReader = New StreamReader(sTxtfile)
Dim doc As New Document()
PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
doc.Open()
doc.Add(New Paragraph(sr.ReadToEnd()))
doc.Close()
End Sub

Categories