So I am trying to convert a standard A4 PDF file into a .txt file using Spire.Pdf NuGet Package, and whenever I do it there is a lot of whitespace at the start of each line where the margins of the document go I presume. I managed to solve the issue using the TrimStart() method but I want to be able to do remove the margins using Spire.Pdf itself.
I have played around with setting a PdfTextExtractOptions ExtractArea RectangleF but for some reason it cuts the bottom of the text and I lose rows.
My code is:
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"path");
var content = new List<string>();
RectangleF rectangle = new RectangleF(45, 0, 0, 0);
PdfTextExtractOptions options = new() { IsExtractAllText = true, IsShowHiddenText = true, ExtractArea = rectangle };
foreach (PdfPageBase page in doc.Pages)
{
PdfTextExtractor textExtractor = new(page);
//extract text from a specific rectangular area here - defualt A4 margin sizes?
string extractedText = textExtractor.ExtractText(options);
content.Add(extractedText);
}
FileStream fs = new FileStream(#"outputFile.txt", FileMode.Create);
StreamWriter sw = new StreamWriter(fs);
string txtBefore = (string.Join("\n", content));
sw.Write(txtBefore);
Thanks in advance
You can try the code below to extract text from PDF, it will not generate extra white spaces at the start of each line in the result .txt file. I already tested it.
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"test.pdf");
PdfTextExtractOptions options = new PdfTextExtractOptions();
options.IsSimpleExtraction = true;
StringBuilder sb = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
PdfTextExtractor extractor = new PdfTextExtractor(page);
sb.AppendLine(extractor.ExtractText(options));
}
File.WriteAllText("Extract.txt", sb.ToString());
I have a text field in my database and it has a text with many lines.
When generating a MS Word document using OpenXML and bookmarks, the text become one single line.
I've noticed that in each new line the bookmark value show the characters "\r\n".
Looking for a solution, I've found some answers which helped me, but I'm still having a problem.
I've used the run.Append(new Break()); solution, but the text replaced is showing the name of the bookmark as well.
For example:
bookmark test = "Big text here in first paragraph\r\nSecond paragraph".
It is shown in MS Word document like:
testBig text here in first paragraph
Second paragraph
Can anyone, please, help me to eliminate the bookmark name?
Here is my code:
public void UpdateBookmarksVistoria(string originalPath, string copyPath, string fileType)
{
string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
// Make a copy of the template file.
File.Copy(originalPath, copyPath, true);
//Open the document as an Open XML package and extract the main document part.
using (WordprocessingDocument wordPackage = WordprocessingDocument.Open(copyPath, true))
{
MainDocumentPart part = wordPackage.MainDocumentPart;
//Setup the namespace manager so you can perform XPath queries
//to search for bookmarks in the part.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
//Load the part's XML into an XmlDocument instance.
XmlDocument xmlDoc = new XmlDocument(nt);
xmlDoc.Load(part.GetStream());
//pega a url para exibir as fotos
string url = HttpContext.Current.Request.Url.ToString();
string enderecoURL;
if (url.Contains("localhost"))
enderecoURL = url.Substring(0, 26);
else if (url.Contains("www."))
enderecoURL = url.Substring(0, 24);
else
enderecoURL = url.Substring(0, 20);
//Iterate through the bookmarks.
int cont = 56;
foreach (KeyValuePair<string, string> bookmark in bookmarks)
{
var res = from bm in part.Document.Body.Descendants<BookmarkStart>()
where bm.Name == bookmark.Key
select bm;
var bk = res.SingleOrDefault();
if (bk != null)
{
Run bookmarkText = bk.NextSibling<Run>();
if (bookmarkText != null) // if the bookmark has text replace it
{
var texts = bookmark.Value.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
bookmarkText.Append(new Break());
Text text = new Text();
text.Text = texts[i];
bookmarkText.Append(text); //HERE IS MY PROBLEM
}
}
else // otherwise append new text immediately after it
{
var parent = bk.Parent; // bookmark's parent element
Text text = new Text(bookmark.Value);
Run run = new Run(new RunProperties());
run.Append(text);
// insert after bookmark parent
parent.Append(run);
}
bk.Remove(); // we don't want the bookmark anymore
}
}
//Write the changes back to the document part.
xmlDoc.Save(wordPackage.MainDocumentPart.GetStream(FileMode.Create));
wordPackage.Close();
}}
I am following this structure to add text from strings into OpenXML Runs, Which are part of a Word Document.
The string has new line formatting and even paragraph indentions, but these all get stripped away when the text gets inserted into a run. How can I preserve it?
Body body = wordprocessingDocument.MainDocumentPart.Document.Body;
String txt = "Some formatted string! \r\nLook there should be a new line here!\r\n\r\nAndthere should be 2 new lines here!"
// Add new text.
Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AppendChild(new Text(txt));
You need to use a Break in order to add new lines, otherwise they will just be ignored.
I've knocked together a simple extension method that will split a string on a new line and append Text elements to a Run with Breaks where the new lines were:
public static class OpenXmlExtension
{
public static void AddFormattedText(this Run run, string textToAdd)
{
var texts = textToAdd.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
run.Append(new Break());
Text text = new Text();
text.Text = texts[i];
run.Append(text);
}
}
}
This can be used like this:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(#"c:\somepath\test.docx", true))
{
var body = wordDoc.MainDocumentPart.Document.Body;
String txt = "Some formatted string! \r\nLook there should be a new line here!\r\n\r\nAndthere should be 2 new lines here!";
// Add new text.
Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AddFormattedText(txt);
}
Which produces the following output:
I want to create a PDF file with Arabic text content in C#. I'm using iTextSharp to create this. I followed the instruction in http://geekswithblogs.net/JaydPage/archive/2011/11/02/using-itextsharp-to-correctly-display-hebrew--arabic-text-right.aspx. I want to insert the following Arabic sentence in pdf.
تم إبرام هذا العقد في هذا اليوم [●] م الموافق [●] من قبل وبين .
The [●] need to be replaced by dynamic English words. I tried to implement this by using ARIALUNI.TTF [This tutorial link suggested it]. The code is given below.
public void WriteDocument()
{
//Declare a itextSharp document
Document document = new Document(PageSize.A4);
//Create our file stream and bind the writer to the document and the stream
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(#"D:\Test.Pdf", FileMode.Create));
//Open the document for writing
document.Open();
//Add a new page
document.NewPage();
//Reference a Unicode font to be sure that the symbols are present.
BaseFont bfArialUniCode = BaseFont.CreateFont(#"D:\ARIALUNI.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
//Create a font from the base font
Font font = new Font(bfArialUniCode, 12);
//Use a table so that we can set the text direction
PdfPTable table = new PdfPTable(1);
//Ensure that wrapping is on, otherwise Right to Left text will not display
table.DefaultCell.NoWrap = false;
//Create a regex expression to detect hebrew or arabic code points
const string regex_match_arabic_hebrew = #"[\u0600-\u06FF,\u0590-\u05FF]+";
if (Regex.IsMatch("م الموافق", regex_match_arabic_hebrew, RegexOptions.IgnoreCase))
{
table.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
}
//Create a cell and add text to it
PdfPCell text = new PdfPCell(new Phrase(" : "+"من قبل وبين" + " 2007 " + "م الموافق" + " dsdsdsdsds " + "تم إبرام هذا العقد في هذا اليوم ", font));
//Ensure that wrapping is on, otherwise Right to Left text will not display
text.NoWrap = false;
//Add the cell to the table
table.AddCell(text);
//Add the table to the document
document.Add(table);
//Close the document
document.Close();
//Launch the document if you have a file association set for PDF's
Process AcrobatReader = new Process();
AcrobatReader.StartInfo.FileName = #"D:\Test.Pdf";
AcrobatReader.Start();
}
While calling this function, I got a PDF with some Unicode as given below.
اذه يف دقعلا اذه ماربإ مت dsdsdsdsds قفاوملا م 2007 نيبو لبق نم
مويلا
It is not matching with our hard coded Arabic sentence. Is this a issue of font? Please help me or suggest me any other method to implement the same.
#csharpcoder has the right idea, but his execution is off. He doesn't add the cell to a table, and the table doesn't end up in the document.
void Go()
{
Document doc = new Document(PageSize.LETTER);
string yourPath = "foo/bar/baz.pdf";
using (FileStream os = new FileStream(yourPath, FileMode.Create))
{
PdfWriter.GetInstance(doc, os); // you don't need the return value
doc.Open();
string fontLoc = #"c:\windows\fonts\arialuni.ttf"; // make sure to have the correct path to the font file
BaseFont bf = BaseFont.CreateFont(fontLoc, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font f = new Font(bf, 12);
PdfPTable table = new PdfPTable(1); // a table with 1 cell
Phrase text = new Phrase("العقد", f);
PdfPCell cell = new PdfPCell(text);
table.RunDirection = PdfWriter.RUN_DIRECTION_RTL; // can also be set on the cell
table.AddCell(cell);
doc.Add(table);
doc.Close();
}
}
You will probably want to get rid of the cell borders etc, but that information can be found elsewhere on SO or the iText website. iText should be able to handle text that contains both RTL and LTR characters.
EDIT
I think the source problem is actually with how the Arabic text is rendered in Visual Studio and in Firefox (my browser), or alternatively with how the Strings are concatenated. I'm not very familiar with Arabic text editors, but the text seems to come out correctly if we do this:
FYI I had to take a screenshot, because copy-pasting into the browser from VS (and vice versa) messes up the order of the parts of the text.
Right-to-left writing and Arabic ligatures are only supported in ColumnText and PdfPTable!
Try out the below code :
Document Doc = new Document(PageSize.LETTER);
//Create our file stream
using (FileStream fs = new FileStream(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
{
//Bind PDF writer to document and stream
PdfWriter writer = PdfWriter.GetInstance(Doc, fs);
//Open document for writing
Doc.Open();
//Add a page
Doc.NewPage();
//Full path to the Unicode Arial file
string ARIALUNI_TFF = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "arabtype.TTF");
//Create a base font object making sure to specify IDENTITY-H
BaseFont bf = BaseFont.CreateFont(ARIALUNI_TFF, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font f = new Font(bf, 12);
//Write some text, the last character is 0x0278 - LATIN SMALL LETTER PHI
Doc.Add(new Phrase("This is a ميسو ɸ", f));
//add Arabic text, for instance in a table
PdfPCell cell = new PdfPCell();
cell.AddElement(new Phrase("Hello\u0682", f));
cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
//Close the PDF
Doc.Close();
}
I hope these notes can help you from other answers:
Use a safe code to achieve your font:
var tahomaFontFile = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
"Tahoma.ttf");
Use BaseFont.IDENTITY_H and BaseFont.EMBEDDED properties.
var tahomaBaseFont = BaseFont.CreateFont(tahomaFontFile,
BaseFont.IDENTITY_H,
BaseFont.EMBEDDED);
var tahomaFont = new Font(tahomaBaseFont, 8, Font.NORMAL);
Use PdfWriter.RUN_DIRECTION_RTL, for both your cell and your table:
var table = new PdfPTable(1)
{
RunDirection = PdfWriter.RUN_DIRECTION_RTL
};
var phrase = new Phrase("تم إبرام هذا العقد في هذا اليوم [●] م الموافق [●] من قبل وبين .",
tahomaFont);
var cell = new PdfPCell(phrase)
{
RunDirection = PdfWriter.RUN_DIRECTION_RTL,
Border = 0,
};
i believe your problem in string structure part, try to use the below code it works fine with me, Good Luck.`
public static void GeneratePDF()
{
//Declare a itextSharp document
Document document = new Document(PageSize.A4);
Random ran = new Random();
string PDFFileName = string.Format(#"C:\Test{0}.Pdf", ran);
//Create our file stream and bind the writer to the document and the stream
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(PDFFileName, FileMode.Create));
//Open the document for writing
document.Open();
//Add a new page
document.NewPage();
var ArialFontFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.ttf");
//Reference a Unicode font to be sure that the symbols are present.
BaseFont bfArialUniCode = BaseFont.CreateFont(ArialFontFile, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
//Create a font from the base font
Font font = new Font(bfArialUniCode, 12);
//Use a table so that we can set the text direction
var table = new PdfPTable(1)
{
RunDirection = PdfWriter.RUN_DIRECTION_RTL,
};
//Ensure that wrapping is on, otherwise Right to Left text will not display
table.DefaultCell.NoWrap = false;
ContentObject CO = new ContentObject();
CO.Name = "Ahmed Gomaa";
CO.StartDate = DateTime.Now.AddMonths(-5);
CO.EndDate = DateTime.Now.AddMonths(43);
string content = string.Format(" تم إبرام هذا العقد في هذا اليوم من قبل {0} في تاريخ بين {1} و {2}", CO.Name, CO.StartDate, CO.EndDate);
var phrase = new Phrase(content, font);
//var phrase = new Phrase("الحمد لله رب العالمين", font);
//Create a cell and add text to it
PdfPCell text = new PdfPCell(phrase)
{
RunDirection = PdfWriter.RUN_DIRECTION_RTL,
Border = 0
};
//Ensure that wrapping is on, otherwise Right to Left text will not display
text.NoWrap = false;
//Add the cell to the table
table.AddCell(text);
//Add the table to the document
document.Add(table);
//Close the document
document.Close();
//Launch the document if you have a file association set for PDF's
Process AcrobatReader = new Process();
AcrobatReader.StartInfo.FileName = PDFFileName;
AcrobatReader.Start();
}
}
public class ContentObject
{
public string Name { set; get; }
public DateTime StartDate { set; get; }
public DateTime EndDate { set; get; }
}
`
I am working to replace the specific word inside pdf document using iTextSharp and C#.net, while I am debugging getting the proper value but the outputted pdf getting zero bytes(empty),its not filling with the content.
ReplacePDFText("Mumbai",StringComparison.CurrentCultureIgnoreCase,Application.StartupPath + "\\test.pdf","D:\\test_words_replaced.pdf"); //Do Everything
public void ReplacePDFText(string strSearch, StringComparison scCase, string strSource, string strDest)
{
PdfStamper psStamp = null; //PDF Stamper Object
PdfContentByte pcbContent = null; //Read PDF Content
if (File.Exists(strSource)) //Check If File Exists
{
PdfReader pdfFileReader = new PdfReader(strSource); //Read Our File
psStamp = new PdfStamper(pdfFileReader, new FileStream(strDest, FileMode.Create)); //Read Underlying Content of PDF File
pbProgress.Value = 0; //Set Progressbar Minimum Value
pbProgress.Maximum = pdfFileReader.NumberOfPages; //Set Progressbar Maximum Value
for (int intCurrPage = 1; intCurrPage <= pdfFileReader.NumberOfPages; intCurrPage++) //Loop Through All Pages
{
LocTextExtractionStrategy lteStrategy = new LocTextExtractionStrategy(); //Read PDF File Content Blocks
pcbContent = psStamp.GetUnderContent(intCurrPage); //Look At Current Block
//Determine Spacing of Block To See If It Matches Our Search String
lteStrategy.UndercontentCharacterSpacing = pcbContent.CharacterSpacing;
lteStrategy.UndercontentHorizontalScaling = pcbContent.HorizontalScaling;
//Trigger The Block Reading Process
string currentText = PdfTextExtractor.GetTextFromPage(pdfFileReader, intCurrPage, lteStrategy);
//Determine Match(es)
List<iTextSharp.text.Rectangle> lstMatches = lteStrategy.GetTextLocations(strSearch, scCase);
PdfLayer pdLayer = default(PdfLayer); //Create New Layer
pdLayer = new PdfLayer("Overrite", psStamp.Writer); //Enable Overwriting Capabilities
//Set Fill Colour Of Replacing Layer
pcbContent.SetColorFill(BaseColor.BLACK);
foreach (iTextSharp.text.Rectangle rctRect in lstMatches) //Loop Through Each Match
{
pcbContent.Rectangle(rctRect.Left, rctRect.Bottom, rctRect.Width, rctRect.Height); //Create New Rectangle For Replacing Layer
pcbContent.Fill(); //Fill With Colour Specified
pcbContent.BeginLayer(pdLayer); //Create Layer
pcbContent.SetColorFill(BaseColor.BLACK); //Fill aLyer
pcbContent.Fill(); //Fill Underlying Content
PdfGState pgState = default(PdfGState); //Create GState Object
pgState = new PdfGState();
pcbContent.SetGState(pgState); //Set Current State
pcbContent.SetColorFill(BaseColor.WHITE); //Fill Letters
pcbContent.BeginText(); //Start Text Replace Procedure
pcbContent.SetTextMatrix(rctRect.Left, rctRect.Bottom); //Get Text Location
//Set New Font And Size
pcbContent.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 9);
pcbContent.ShowText("AMAZING!!!!"); //Replacing Text
pcbContent.EndText(); //Stop Text Replace Procedure
pcbContent.EndLayer(); //Stop Layer replace rocedure
}
pbProgress.Value++; //Increase Progressbar Value
pdfFileReader.Close(); //Close File
}
//psStamp.Close(); //Close Stamp Object
}
}
You call
pdfFileReader.Close();
much too early: inside a loop in which the next iteration still requires pdfFileReader and furthermore before closing the stamper.
The stamper requires the PdfReader to still be open when the stamper closes because it copies certain parts of the reader only then.