How to Preserve string with formatting in OpenXML Paragraph, Run, Text? - c#

I am following this structure to add text from strings into OpenXML Runs, Which are part of a Word Document.
The string has new line formatting and even paragraph indentions, but these all get stripped away when the text gets inserted into a run. How can I preserve it?
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
String txt = "Some formatted string! \r\nLook there should be a new line here!\r\n\r\nAndthere should be 2 new lines here!"
// Add new text.
Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AppendChild(new Text(txt));

You need to use a Break in order to add new lines, otherwise they will just be ignored.
I've knocked together a simple extension method that will split a string on a new line and append Text elements to a Run with Breaks where the new lines were:
public static class OpenXmlExtension
{
public static void AddFormattedText(this Run run, string textToAdd)
{
var texts = textToAdd.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
run.Append(new Break());
Text text = new Text();
text.Text = texts[i];
run.Append(text);
}
}
}
This can be used like this:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(#"c:\somepath\test.docx", true))
{
var body = wordDoc.MainDocumentPart.Document.Body;
String txt = "Some formatted string! \r\nLook there should be a new line here!\r\n\r\nAndthere should be 2 new lines here!";
// Add new text.
Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AddFormattedText(txt);
}
Which produces the following output:

Related

Prevent wrapped text overlap in iText7

I'm working on making a little program that just takes a .txt file and converts it into a PDF, and I want the PDF to look similar to how the text file would look if I used Microsoft's Print to PDF. I have it pretty close, but when a line of text exceeds the width of the page it wraps to a new line and the wrapped text overlaps the text above it. How do I get the wrapped text to behave as if I'm adding a new paragraph to the document without splitting the wrapped text into a new paragraph?
Here's my old code:
string dest = #"..\TXT2PDF\Test.pdf";
string source = #"..\TXT2PDF\test2.txt";
string fpath = #"..\TXT2PDF\consola.ttf";
string line;
FileInfo destFile = new FileInfo(dest);
destFile.Directory.Create();
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdf = new PdfDocument(writer);
PageSize ps = new PageSize(612, 792);
pdf.SetDefaultPageSize(ps);
Document document = new Document(pdf);
PdfFont font = PdfFontFactory.CreateFont(fpath);
StreamReader file = new StreamReader(source);
Console.WriteLine("Beginning Conversion");
document.SetLeftMargin(54);
document.SetRightMargin(54);
document.SetTopMargin(72);
document.SetBottomMargin(72);
while ((line = file.ReadLine()) != null)
{
Paragraph p = new Paragraph();
p.SetFixedLeading(4.8f);
p.SetFont(font).SetFontSize(10.8f);
p.SetPaddingTop(4.8f);
p.Add("\u00A0");
p.Add(line);
document.Add(p);
}
document.Close();
file.Close();
Console.WriteLine("Conversion Finished");
Console.ReadLine();
Here's my new code:
string dest = #"..\TXT2PDF\Test.pdf";
string source = #"..\TXT2PDF\test2.txt";
string fpath = #"..\TXT2PDF\consola.ttf";
string line;
FileInfo destFile = new FileInfo(dest);
destFile.Directory.Create();
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdf = new PdfDocument(writer);
PageSize ps = new PageSize(612, 792);
pdf.SetDefaultPageSize(ps);
Document document = new Document(pdf);
PdfFont font = PdfFontFactory.CreateFont(fpath, "cp1250", true);
StreamReader file = new StreamReader(source);
Console.WriteLine("Beginning Conversion");
document.SetLeftMargin(54);
document.SetRightMargin(54);
document.SetTopMargin(68);
document.SetBottomMargin(72);
document.SetProperty(Property.LEADING, new Leading(Leading.MULTIPLIED, 1.018f));
Paragraph p = new Paragraph();
p.SetFont(font).SetFontSize(10.8f);
p.SetCharacterSpacing(0.065f);
string nl = "";
while ((line = file.ReadLine()) != null)
{
Text t = new Text(nl + "\u0000" + line);
p.Add(t);
nl = "\n";
}
document.Add(p);
document.Close();
file.Close();
Console.WriteLine("Conversion Finished");
Console.ReadLine();
Here's an example of what the output looks like:
Edit
With mkl's recommendation I replaced p.SetFixedLeading(4.8f) with document.SetProperty(Property.LEADING, new Leading(Leading.MULTIPLIED, 1.018f)). That fixed the spacing issue for the wrapped text, but it caused the space between the paragraphs to increase more than I wanted. In order to get around that, I decided to only use one paragraph object and add each line as a new text object to the paragraph. I had tried that once before, but the text objects weren't going on new lines. I had to add the new line character to the beginning of each text object in order for them to be on their own line.
This is how the output looks now:
Rather than making each line from the txt file be a new paragraph, I changed each line to be a new text object, then I added each text object to a single paragraph. I then stopped using fixed leading on the paragraph and started using multiplied leading on the document itself using document.SetProperty(Property.LEADING, new Leading(Leading.MULTIPLIED, 1.018f)). I could've also used p.SetMultipliedLeading(1.018f) though. This achieved my desired spacing in the document.

Append characters to new page in `docx` document

I am trying to append a string to the end of a docx document in a new page.
Here is the code I use now:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(path/fname, true))
var body = wordDoc.MainDocumentPart.Document.Body;
var para = body.AppendChild(new Paragraph());
var run = para.AppendChild(new Run());
var txt = "Document Signed by User" + Environment.NewLine;
run.AppendChild(new Text(txt));
But it appends the text in the end of the document and not in a new page.
Edit
Used the solution proposed by Daniel A. White:
var para = body.AppendChild(new Paragraph());
var run = para.AppendChild(new Run());
var plc = run.AppendChild(new Break() { Type = BreakValues.Page });
var txt = "Document Signed by User: " + user.User" + Environment.NewLine;
plc.AppendChild(new Text(txt));
But I got this error :
Non-composite elements do not have child elements
Insert a Break into your Run with Type set to BreakValues.Page.
run.AppendChild(new Break() { Type = BreakValues.Page });

C# OpenXML How to Replace \r\n with Break()?

I have a text field in my database and it has a text with many lines.
When generating a MS Word document using OpenXML and bookmarks, the text become one single line.
I've noticed that in each new line the bookmark value show the characters "\r\n".
Looking for a solution, I've found some answers which helped me, but I'm still having a problem.
I've used the run.Append(new Break()); solution, but the text replaced is showing the name of the bookmark as well.
For example:
bookmark test = "Big text here in first paragraph\r\nSecond paragraph".
It is shown in MS Word document like:
testBig text here in first paragraph
Second paragraph
Can anyone, please, help me to eliminate the bookmark name?
Here is my code:
public void UpdateBookmarksVistoria(string originalPath, string copyPath, string fileType)
{
string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
// Make a copy of the template file.
File.Copy(originalPath, copyPath, true);
//Open the document as an Open XML package and extract the main document part.
using (WordprocessingDocument wordPackage = WordprocessingDocument.Open(copyPath, true))
{
MainDocumentPart part = wordPackage.MainDocumentPart;
//Setup the namespace manager so you can perform XPath queries
//to search for bookmarks in the part.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
//Load the part's XML into an XmlDocument instance.
XmlDocument xmlDoc = new XmlDocument(nt);
xmlDoc.Load(part.GetStream());
//pega a url para exibir as fotos
string url = HttpContext.Current.Request.Url.ToString();
string enderecoURL;
if (url.Contains("localhost"))
enderecoURL = url.Substring(0, 26);
else if (url.Contains("www."))
enderecoURL = url.Substring(0, 24);
else
enderecoURL = url.Substring(0, 20);
//Iterate through the bookmarks.
int cont = 56;
foreach (KeyValuePair<string, string> bookmark in bookmarks)
{
var res = from bm in part.Document.Body.Descendants<BookmarkStart>()
where bm.Name == bookmark.Key
select bm;
var bk = res.SingleOrDefault();
if (bk != null)
{
Run bookmarkText = bk.NextSibling<Run>();
if (bookmarkText != null) // if the bookmark has text replace it
{
var texts = bookmark.Value.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
bookmarkText.Append(new Break());
Text text = new Text();
text.Text = texts[i];
bookmarkText.Append(text); //HERE IS MY PROBLEM
}
}
else // otherwise append new text immediately after it
{
var parent = bk.Parent; // bookmark's parent element
Text text = new Text(bookmark.Value);
Run run = new Run(new RunProperties());
run.Append(text);
// insert after bookmark parent
parent.Append(run);
}
bk.Remove(); // we don't want the bookmark anymore
}
}
//Write the changes back to the document part.
xmlDoc.Save(wordPackage.MainDocumentPart.GetStream(FileMode.Create));
wordPackage.Close();
}}

WPF print strings as lines

In my app I have to print some data to a printer. This data is stored in a collection, each record in the collection has a string field and that is what is to be printed. Each of these string fields should be one line. I figured to do something like this
FlowDocument doc = new FlowDocument();
foreach (var x in myCollection)
{
Paragraph p = new Paragraph(new Run(x.PrintString));
doc.Blocks.Add(p);
}
doc.Name = "FlowDoc";
IDocumentPaginatorSource idpSource = doc;
printDlg.PrintDocument(idpSource.DocumentPaginator, "My Printing");
The problem is that there is an empty space after every line, something like this;
Line 1
Line 2
Line 3
When I need it to look like this;
Line 1
Line 2
Line 3
Thanks
edit: Add the definition of doc
Based on the comments I was able to it to work with the following
FlowDocument doc = new FlowDocument();
Paragraph p = new Paragraph();
foreach (var x in myCollection)
{
p.Inlines.Add(x.PrintString + "\r\n");
p.Margin = new Thickness(0);
}
doc.Blocks.Add(p);
doc.Name = "FlowDoc";
IDocumentPaginatorSource idpSource = doc;
printDlg.PrintDocument(idpSource.DocumentPaginator, "My Printing");

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();
public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}
I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.
All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());
LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class
Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;
Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

Categories