How to merge multiple RTF files into one RTF in C# - c#

I'm trying to merge multiple rtf documents into one. the size of the merged is increased (size of all the documents) but when i open, i could see only the first RTF file content only.
string srcpath = #"C:\CSI\RTFtest\src\";
string despath = #"C:\CSI\RTFtest\dest\single.rtf";
string content = "";
List<string> files = new List<string>(Directory.GetFiles(srcpath, "*.rtf"));
StreamReader read;
if (files.Count > 1)
{
for (int i = 0; i < files.Count; i++)
{
String filename = files[i];
content = File.ReadAllText(filename);
//content = content + read.ReadToEnd();
File.AppendAllText(despath, content.ToString());
File.AppendAllText(despath, System.Environment.NewLine);
}

RTF Files are not text files. You can't jsut concatenate the text as there are headers and other structures involved. You can read all about this spec here http://support.microsoft.com/kb/86999 (yuck).
You can use a TextRange object if you're using wpf, load the first file into it then append each additional file's content. Or you could read files into richtextbox objects in winforms and append the content (how to load: http://msdn.microsoft.com/en-us/library/1z7hy77a.aspx).
I guess you could use TextBox1.Rtf = TextBox1.Rtf + textBox2.rtf untill all are loaded since the rtf property is the string with rtf encoding in it.

Use StringBuilder. You can append RTF to RTF.
Example:
StringBuilder sb = new StringBuilder();
sb.Append(#"{\rtf1\ansi");
sb.Append(#"...
HPIrichTextBox.Rtf = sb.ToString();

Related

Convert bytes to PDF File

I am able to create a word doc using the code below.
Question: how do i create a pdf instead of word doc?
Code
using (StreamWriter outputFile = new StreamWriter(Path.Combine(docPath, tdindb.TDCode + "-test.doc")))
{
string html = string.Format("<html>{0}</html>", sbHtml);
outputFile.WriteLine(html);
}
string FileLocation = docPath + "\\" + tdindb.TDCode + "-test.doc";
byte[] fileBytes = System.IO.File.ReadAllBytes(FileLocation);
string fileName = Path.GetFileName(FileLocation);
return File(fileBytes, System.Net.Mime.MediaTypeNames.Application.Octet, fileName);
Thank you
You need to use one of PDF creating libraries. I tried to use iText , IronPDF, and PDFFlow. All of them create PDF documents from scratch.
But PDFFlow was better for my case because i needed automatic page creation and multi-page spread table)
This is how to create a simple PDF file in C#:
{
var DocumentBuilder.New()
.AddSection()
.AddParagraphToSection("your text goes here!")
.ToSection()
.ToDocument()
.Build("Result.PDF");
}
feel free to ask me if you need more help.
Converting a Word document to HTML has been answered here before. The linked example is the first result of many on this site.
Once you have your HTML, to create a PDF you need to use a PDF creation library. For this example we will use IronPDF which requires just 3 lines of code:
string html = string.Format("<html>{0}</html>", sbHtml);
var renderer = new IronPdf.ChromePdfRenderer();
// Save PDF file to BinaryData
renderer.RenderHtmlAsPdf(html).BinaryData;
// Save file to location
renderer.RenderHtmlAsPdf(html).SaveAs("output.pdf");

iTextSharp How to read Table in PDF file

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

Convert a Word (DOCX) file to a PDF in C# on cloud environment

I have generated a word file using Open Xml and I need to send it as attachment in a email with pdf format but I cannot save any physical pdf or word file on disk because I develop my application in cloud environment(CRM online).
I found only way is "Aspose Word to .Net".
http://www.aspose.com/docs/display/wordsnet/How+to++Convert+a+Document+to+a+Byte+Array But it is too expensive.
Then I found a solution is to convert word to html, then convert html to pdf. But there is a picture in my word. And I cannot resolve the issue.
The most accurate conversion from DOCX to PDF is going to be through Word. Your best option for that is setting up a server with OWAS (Office Web Apps Server) and doing your conversion through that.
You'll need to set up a WOPI endpoint on your application server and call:
/wv/WordViewer/request.pdf?WOPISrc={WopiUrl}&type=downloadpdf
OR
/wv/WordViewer/request.pdf?WOPISrc={WopiUrl}&type=printpdf
Alternatively you could try and do it using OneDrive and Word Online, but you'll need to work out the parameters Word Online uses as well as whether that's permitted within the Ts & Cs.
You can try Gnostice XtremeDocumentStudio .NET.
Converting From DOCX To PDF Using XtremeDocumentStudio .NET
http://www.gnostice.com/goto.asp?id=24900&t=convert_docx_to_pdf_using_xdoc.net
In the published article, conversion has been demonstrated to save to a physical file. You can use documentConverter.ConvertToStream method to convert a document to a Stream as shown below in the code snippet.
DocumentConverter documentConverter = new DocumentConverter();
// input can be a FilePath, Stream, list of FilePaths or list of Streams
Object input = "InputDocument.docx";
string outputFileFormat = "pdf";
ConversionMode conversionMode = ConversionMode.ConvertToSeperateFiles;
List<Stream> outputStreams = documentConverter.ConvertToStream(input, outputFileFormat, conversionMode);
Disclaimer: I work for Gnostice.
If you wanna convert bytes array, then to use Metamorphosis:
string docxPath = #"example.docx";
string pdfPath = Path.ChangeExtension(docxPath, ".pdf");
byte[] docx = File.ReadAllBytes(docxPath);
// Convert DOCX to PDF in memory
byte[] pdf = p.DocxToPdfConvertByte(docx);
if (pdf != null)
{
// Save the PDF document to a file for a viewing purpose.
File.WriteAllBytes(pdfPath, pdf);
System.Diagnostics.Process.Start(pdfPath);
}
else
{
System.Console.WriteLine("Conversion failed!");
Console.ReadLine();
}
I have recently used SautinSoft 'Document .Net' library to convert docx to pdf in my React(frontend), .NET core(micro services- backend) application. It only take 15 seconds to generate a pdf having 23 pages. This 15 seconds includes getting data from database, then merging data with docx template and then converting it to pdf. The code has deployed to azure Linux box and works fine.
https://sautinsoft.com/products/document/
Sample code
public string GeneratePDF(PDFDocumentModel document)
{
byte[] output = null;
using (var outputStream = new MemoryStream())
{
// Create single pdf.
DocumentCore singlePDF = new DocumentCore();
var documentCores = new List<DocumentCore>();
foreach (var section in document.Sections)
{
documentCores.Add(GenerateDocument(section));
}
foreach (var dc in documentCores)
{
// Create import session.
ImportSession session = new ImportSession(dc, singlePDF, StyleImportingMode.KeepSourceFormatting);
// Loop through all sections in the source document.
foreach (Section sourceSection in dc.Sections)
{
// Because we are copying a section from one document to another,
// it is required to import the Section into the destination document.
// This adjusts any document-specific references to styles, bookmarks, etc.
// Importing a element creates a copy of the original element, but the copy
// is ready to be inserted into the destination document.
Section importedSection = singlePDF.Import<Section>(sourceSection, true, session);
// First section start from new page.
if (dc.Sections.IndexOf(sourceSection) == 0)
importedSection.PageSetup.SectionStart = SectionStart.NewPage;
// Now the new section can be appended to the destination document.
singlePDF.Sections.Add(importedSection);
//Paging
HeaderFooter footer = new HeaderFooter(singlePDF, HeaderFooterType.FooterDefault);
// Create a new paragraph to insert a page numbering.
// So that, our page numbering looks as: Page N of M.
Paragraph par = new Paragraph(singlePDF);
par.ParagraphFormat.Alignment = HorizontalAlignment.Center;
CharacterFormat cf = new CharacterFormat() { FontName = "Consolas", Size = 11.0 };
par.Content.Start.Insert("Page ", cf.Clone());
// Page numbering is a Field.
Field fPage = new Field(singlePDF, FieldType.Page);
fPage.CharacterFormat = cf.Clone();
par.Content.End.Insert(fPage.Content);
par.Content.End.Insert(" of ", cf.Clone());
Field fPages = new Field(singlePDF, FieldType.NumPages);
fPages.CharacterFormat = cf.Clone();
par.Content.End.Insert(fPages.Content);
footer.Blocks.Add(par);
importedSection.HeadersFooters.Add(footer);
}
}
var pdfOptions = new PdfSaveOptions();
pdfOptions.Compression = false;
pdfOptions.EmbedAllFonts = false;
pdfOptions.EmbeddedImagesFormat = PdfSaveOptions.EmbImagesFormat.Png;
pdfOptions.EmbeddedJpegQuality = 100;
//dont allow editing after population, also ensures content can be printed.
pdfOptions.PreserveFormFields = false;
pdfOptions.PreserveContentControls = false;
if (!string.IsNullOrEmpty(document.PdfProperties.Title))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Title] = document.PdfProperties.Title;
}
if (!string.IsNullOrEmpty(document.PdfProperties.Author))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Author] = document.PdfProperties.Author;
}
if (!string.IsNullOrEmpty(document.PdfProperties.Subject))
{
singlePDF.Document.Properties.BuiltIn[BuiltInDocumentProperty.Subject] = document.PdfProperties.Subject;
}
singlePDF.Save(outputStream, pdfOptions);
output = outputStream.ToArray();
}
return Convert.ToBase64String(output);
}

Replace static data in HTML page with dynamic one from database

I have an HTML code for my asp.net web application is that when i press the button Submit it calls a function to read what's in the bd and load it into an html page with certain template and I want to know how to replace static data in the original HTML file with dynamic ones from mysql database using C#. In my code I did read the input file and i did create an output file to write in as shown below:
StreamReader sr = new StreamReader("filepath/inputcv.html");
StreamWriter sw = new StreamWriter("filepath/outputcv.html");
This is a part of the code i need to replace the content of this paragraph by the one in my database
<div class="sectionContent">
<p>#1</p>
</div>
I saw this code and I wanted to do like it but I don't know how to write the query in it
StreamReader sr = new StreamReader("path/to/file.txt");
StreamWriter sw = new StreamWriter("path/to/outfile.txt");
string sLine = sr.ReadLine();
for (; sLine; sLine = sr.ReadLine() )
{
sLine = "{" + sLine.Replace(" ", ", ") + "}";
sw.Write(sLine);
}
It's possible to store the external data in either text file, or XML file, or database and use it for dynamic update of HTML page content. The first requirement to clarify: what will work as the "filter" for that "dynamic" data updates, in other words, which part of your code will perform dynamic updates the select statement (in case of mySQL)? If that statement remains the same, then it means that underlying data is changing, but what is controlling that changes in DB? And what is the desirable data refreshment rate of page updates? This should be clarified first before proceeding to actual code.
Hope this will help. My best, AB
Here is what I come up with based upon what you have provided. Please comment with more details if this is not enough.
string strHTMLPage = "";
string strNewHTMLPage = "";
int intStartIndex = 0;
int intEndIndex = 0;
string strNewDataToBeInserted = "Assumes you loaded this string with the
data you want inserted";
StreamReader sr = new StreamReader("filepath/inputcv.html");
strHTMLPage = sr.ReadToEnd();
sr.Close();
intStartIndex = strHTMLPage.IndexOf("<div class=\"sectionContent\">", 0) + 28;
intStartIndex = strHTMLPage.IndexOf("<p>", intStartIndex) + 3;
intEndIndex = strHTMLPage.IndexOf("</p>", intStartIndex);
strNewHTMLPage = strHTMLPage.Substring(0, intStartIndex);
strNewHTMLPage += strNewDataToBeInserted;
strNewHTMLPage += strHTMLPage.Substring(intEndIndex);
StreamWriter sw = new System.IO.StreamWriter("filepath/outputcv.html", false, Encoding.UTF8);
sw.Write(strNewHTMLPage);
sw.Close();

Itextsharp text extraction

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)
token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
I can meassure the length of the content but I cannot get the actual string content.
I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.
Now the question is, How can I extract text regardless of the font setting?
Thanks
complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}
Check out PdfTextExtractor.
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum);
or
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());
Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.
PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.
Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.
string strFile = #"C:\my\path\tothefile.pdf";
iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
int pageFrom = 1;
int pageTo = pdfRida.NumberOfPages;
iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
string tknValue;
for (int i = pageFrom; i <= pageTo; i++)
{
iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if(cannots!=null)
foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList)
{
iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);
iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString))
{
string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
if (cStringVal.ToUpper().Contains("LOS 8"))
{ // DO SOMETHING FUN
}
}
}
}
pdfRida.Close();

Categories