Get plain text from an RTF text

Get plain text from an RTF text - c#

I have on my database a column that holds text in RTF format.
How can I get only the plain text of it, using C#?
Thanks :D

Microsoft provides an example where they basically stick the rtf text in a RichTextBox and then read the .Text property... it feels somewhat kludgy, but it works.
static public string ConvertToText(string rtf)
{
using(RichTextBox rtb = new RichTextBox())
{
rtb.Rtf = rtf;
return rtb.Text;
}
}

for WPF you can use
(using Xceed WPF Toolkit) this extension method :
public static string RTFToPlainText(this string s)
{
// for information : default Xceed.Wpf.Toolkit.RichTextBox formatter is RtfFormatter
Xceed.Wpf.Toolkit.RichTextBox rtBox = new Xceed.Wpf.Toolkit.RichTextBox(new System.Windows.Documents.FlowDocument());
rtBox.Text = s;
rtBox.TextFormatter = new Xceed.Wpf.Toolkit.PlainTextFormatter();
return rtBox.Text;
}

If you want a pure code version, you can parse the rtf yourself and keep only the text bits. It's a bit of work, but not very difficult work - RTF files have a very simple syntax. Read about it in the RTF spec.

Related

how to load and edit .docx/.doc file in richtexbox or any other control in winforms with the document correct format?

I'm creating one winforms desktop application in c# that load .docx file. I want to load .docx file into RichTextBox. But when I'm trying to load .docx file, the format of that file is not getting correct. is there any other control or method to load and save .docx file with correct document format?

Use this to read the file text and add the returned string to the Richtextbox.
private string GetWordFileText(string filepath)
{
Microsoft.Office.Interop.Word.ApplicationClass WordApp = null;
Microsoft.Office.Interop.Word.Document doc = null;
try
{
WordApp = new Microsoft.Office.Interop.Word.ApplicationClass();
doc = WordApp.Documents.Open(filepath, Visible: false);
string toReturn = doc.Content.Text;
return toReturn;
}
catch (Exception e)
{ throw e; }
finally
{
doc.Close();
WordApp.Quit();
}
}
After this comes the document styling. I don't have a code snippet handy for this but it would technically work as follows:
var formatting = Dictionary<string, Style>();
foreach(Paragraph para in doc.Paragraphs)
{
formatting.Add(para.Range.Text, (Style) para.getStyle());
}
Then inside the RichTextbox control you should figure out a method to apply styling
foreach(var fItem in formatting)
{
ApplyStyle(richTextBox, fItem.Key, fItem.Value);
}
void ApplyStyle(RichTextBox tb, string toFormat, Style style)
{
tb.SelectionFont = new Font(style.Font.Name, style.Font.Size);
tb.SelectedText = toFormat;
}

Finally, I got an answer. For load, Edit, Display - You can use RichEditControl of DexExpress tool, It comes up with all features of Word and also you can load RTF, .doc, .docx files in this control and edit it.

Extract text from pdf by format

I am trying to extract the headlines from pdfs.
Until now I tried to read the plain text and take the first line (which didn't work because in plain text the headlines were not at the beginning) and just read the text from a region (which didn't work, because the regions are not always the same).
The easiest way to do this is in my opinion to read just text with a special format (font, fontsize etc.).
Is there a way to do this?

You can enumerate all text objects on a PDF page using Docotic.Pdf library. For each of the text objects information about the font and the size of the object is available. Below is a sample
public static void listTextObjects(string inputPdf)
{
using (PdfDocument pdf = new PdfDocument(inputPdf))
{
string format = "{0}\n{1}, {2}px at {3}";
foreach (PdfPage page in pdf.Pages)
{
foreach (PdfPageObject obj in page.GetObjects())
{
if (obj.Type != PdfPageObjectType.Text)
continue;
PdfTextData text = (PdfTextData)obj;
string message = string.Format(format, text.Text, text.Font.Name,
text.Size.Height, text.Position);
Console.WriteLine(message);
}
}
}
}
The code will output lines like the following for each text object on each page of the input PDF file.
FACTUUR
Helvetica-BoldOblique, 19.04px at { X=51.12; Y=45.54 }
You can use the retrieved information to find largest text or bold text or text with other properties used to format the headline.
If your PDF is guaranteed to have headline as the topmost text on a page than you can use even simpler approach
public static void printText(string inputPdf)
{
using (PdfDocument pdf = new PdfDocument(inputPdf))
{
foreach (PdfPage page in pdf.Pages)
{
string text = page.GetTextWithFormatting();
Console.WriteLine(text);
}
}
}
The GetTextWithFormatting method returns text in the reading order (i.e from left top to right bottom position).
Disclaimer: I am one of the developer of the library.

Itextsharp text extraction

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)
token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
I can meassure the length of the content but I cannot get the actual string content.
I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.
Now the question is, How can I extract text regardless of the font setting?
Thanks

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}

Check out PdfTextExtractor.
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum);
or
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());
Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.
PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.

Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.
string strFile = #"C:\my\path\tothefile.pdf";
iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
int pageFrom = 1;
int pageTo = pdfRida.NumberOfPages;
iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
string tknValue;
for (int i = pageFrom; i <= pageTo; i++)
{
iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if(cannots!=null)
foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList)
{
iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);
iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString))
{
string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
if (cStringVal.ToUpper().Contains("LOS 8"))
{ // DO SOMETHING FUN
}
}
}
}
pdfRida.Close();

Convert HTML or PDF to RTF/DOC or HTML/PDF to image using DevExpress or Infragistics

There is a way do convert HTML or PDF to RTF/DOC or HTML/PDF to image using DevExpress or Infragistics?
I tried this using DevExpress:
string html = new StreamReader(Server.MapPath(#".\teste.htm")).ReadToEnd();
RichEditControl richEditControl = new RichEditControl();
string rtf;
try
{
richEditControl.HtmlText = html;
rtf = richEditControl.RtfText;
}
finally
{
richEditControl.Dispose();
}
StreamWriter sw = new StreamWriter(#"D:\teste.rtf");
sw.Write(rtf);
sw.Close();
But I have a complex html content (tables, backgrounds, css etc) and the final result is not good...

To convert Html content into image or Pdf you may use the following code:
using (RichEditControl richEditControl = new RichEditControl()) {
richEditControl.LoadDocument(Server.MapPath(#".\teste.htm"), DocumentFormat.Html);
using (PrintingSystem ps = new PrintingSystem()) {
PrintableComponentLink pcl = new PrintableComponentLink(ps);
pcl.Component = richEditControl;
pcl.CreateDocument();
//pcl.PrintingSystem.ExportToPdf("teste.pdf");
pcl.PrintingSystem.ExportToImage("teste.jpg", System.Drawing.Imaging.ImageFormat.Jpeg);
}
}

I suggest you to use latest DevExpress version (version 10.1.5 this time). It handles tables much better than previous ones.
Please use the following code to avoid encoding issues (StreamReader and StreamWriter in your sample always use Encoding.UTF8 encoding, this will corrupt any content stored with another encoding):
using (RichEditControl richEditControl = new RichEditControl()) {
richEditControl.LoadDocument(Server.MapPath(#".\teste.htm"), DocumentFormat.Html);
richEditControl.SaveDocument(#"D:\teste.rtf", DocumentFormat.Rtf);
}
Also take a look at the richEditControl.Options.Import.Html and richEditControl.Options.Export.Rtf properties, you may find them useful for some cases.

Convert XML to Plain Text

My goal is to build an engine that takes the latest HL7 3.0 CDA documents and make them backward compatible with HL7 2.5 which is a radically different beast.
The CDA document is an XML file which when paired with its matching XSL file renders a HTML document fit for display to the end user.
In HL7 2.5 I need to get the rendered text, devoid of any markup, and fold it into a text stream (or similar) that I can write out in 80 character lines to populate the HL7 2.5 message.
So far, I'm taking an approach of using XslCompiledTransform to transform my XML document using XSLT and product a resultant HTML document.
My next step is to take that document (or perhaps at a step before this) and render the HTML as text. I have searched for a while, but can't figure out how to accomplish this. I'm hoping its something easy that I'm just overlooking, or just can't find the magical search terms. Can anyone offer some help?
FWIW, I've read the 5 or 10 other questions in SO which embrace or admonish using RegEx for this, and don't think that I want to go down that road. I need the rendered text.
using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;
public class TransformXML
{
public static void Main(string[] args)
{
try
{
string sourceDoc = "C:\\CDA_Doc.xml";
string resultDoc = "C:\\Result.html";
string xsltDoc = "C:\\CDA.xsl";
XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
XslCompiledTransform myXslTransform = new XslCompiledTransform();
XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
myXslTransform.Load(xsltDoc);
myXslTransform.Transform(myXPathDocument, null, writer);
writer.Close();
StreamReader stream = new StreamReader (resultDoc);
}
catch (Exception e)
{
Console.WriteLine ("Exception: {0}", e.ToString());
}
}
}

Since you have the XML source, consider writing an XSL that will give you the output you want without the intermediate HTML step. It would be far more reliable than trying to transform the HTML.

This will leave you with just the text:
class Program
{
static void Main(string[] args)
{
var blah = new System.IO.StringReader(sourceDoc);
var reader = System.Xml.XmlReader.Create(blah);
StringBuilder result = new StringBuilder();
while (reader.Read())
{
result.Append( reader.Value);
}
Console.WriteLine(result);
}
static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}

Or you can use a regular expression:
public static string StripHtml(String htmlText)
{
// replace all tags with spaces...
htmlText = Regex.Replace(htmlText, #"<(.|\n)*?>", " ");
// .. then eliminate all double spaces
while (htmlText.Contains(" "))
{
htmlText = htmlText.Replace(" ", " ");
}
// clear out non-breaking spaces and & character code
htmlText = htmlText.Replace(" ", " ");
htmlText = htmlText.Replace("&", "&");
return htmlText;
}

Can you use something like this which uses lynx and perl to render the html and then convert that to plain text?

This is a great use-case for XSL:FO and FOP. FOP isn't just for PDF output, one of the other major outputs that is supported is text. You should be able to construct a simple xslt + fo stylesheet that has the specifications (i.e. line width) that you want.
This solution will is a bit more heavy-weight that just using xml->xslt->text as ScottSEA suggested, but if you have any more complex formatting requirements (e.g. indenting), it will become much easier to express in fo, than mocking up in xslt.
I would avoid regexs for extracting the text. That's too low-level and guaranteed to be brittle. If you just want text and 80 character lines, the default xslt template will only print element text. Once you have only the text, you can apply whatever text processing is necessary.
Incidentally, I work for a company who produces CDAs as part of our product (voice recognition for dications). I would look into an XSLT that transforms the 3.0 directly into 2.5. Depending on the fidelity you want to keep between the two versions, the full XSLT route will probably be your easiest bet if what you really want to achieve is conversion between the formats. That's what XSLT was built to do.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get plain text from an RTF text - c#

I have on my database a column that holds text in RTF format. How can I get only the plain text of it, using C#? Thanks :D

Microsoft provides an example where they basically stick the rtf text in a RichTextBox and then read the .Text property... it feels somewhat kludgy, but it works. static public string ConvertToText(string rtf) { using(RichTextBox rtb = new RichTextBox()) { rtb.Rtf = rtf; return rtb.Text; } }

If you want a pure code version, you can parse the rtf yourself and keep only the text bits. It's a bit of work, but not very difficult work - RTF files have a very simple syntax. Read about it in the RTF spec.

Related

how to load and edit .docx/.doc file in richtexbox or any other control in winforms with the document correct format?

Extract text from pdf by format

Itextsharp text extraction

Convert HTML or PDF to RTF/DOC or HTML/PDF to image using DevExpress or Infragistics

Convert XML to Plain Text

Categories

Resources