Parse formatted text from RTF

Parse formatted text from RTF - c#

I'm trying to parse lines of bolded text from an RTF file. Right now, I'm sort of doing it by using Regex and looking for the "\b...\b0" tags in the file, but that leaves a lot of formatting text, and there are so many formatting tags in RTF that I can't just hard code it all out and call it a day. Is there a more elegant existing solution for parsing only lines with specific formatting?

I'd use an RTF parser... RichTextBox comes to mind. There are several ways of obtaining the formatting using the RTB.

No. I recently tackled a project in which we had to take an RTF document, complete with embedded media, and convert it to a MIME multipart message. We constructed several sets of RegEx to break apart the sections of the document and then converted each formatting option to an appropriate HTML/CSS tag. There really isn't an "elegant" way to do what you wish.
What are you trying to do with the RTF? Our end-goal was to have a HTML conversion of the RTF supplied. I know that RichTextBox, within the WPF world, has the ability to save out to several formats, such as XAML, which may get rid of the need to handle the parsing yourself.
Also, there are RTF Converters out on the market, so with some more context I could suggest something better.

You should take a look at RtfDomParser.
I found some cases where the parser does not work but globally it's ok.

Related

Is there a high fidelity way to convert HTML into PDF and DOCX?

I need to convert HTML files into PDF and DOCX respectively (just the HTML -> PDF part would good enough for now though).
Obviously I know there are some projects that help with what I want to achieve, I am currently using HTML-Renderer for the PDF part, and OpenXML for the DOCX.
I've tried HTML-Renderer but the fidelity of the conversion is not great, since I read somewhere I can't make headers and footers with HTML for multipage formats. furthermore the conversion scratches off the end of the text when it passes from one page to another.
As for the DOCX, I don't know what the best options are.
I want, if possible, to know what are good high fidelity ways to convert HTML to those formats, any helps is greatly appreciated.
I'm open to ideas/advice on how to make it myself, but right now I don't have the time to do so, so I would much rather use an existent NuGet/DLL/library.

You could consider shelling out to pandoc:
https://pandoc.org
For visual appeal, you might like the Eisvogel template:
https://github.com/Wandmalfarbe/pandoc-latex-template
...which although designed for Markdown, ought to work for well structured, semantic HTML as input to Pandoc too.

C# PDFsharp output RTF text

I have data in SQL that is in RTF as it contains a lot of superscript characters. I am trying to print the data on a PDF using PDFsharp (not MigraDoc) using DrawString, however, as I expected, it just shows the RTF string...
I tried putting it in a RichTextBox and then retrieving the Text property, this gives the correct plain text but not in superscript format, which I need.
Can anyone tell me how to correctly output the RTF data?

First from FAQ of pdfsharp:
Can I use PDFsharp to convert HTML or RTF to PDF?
No, not "out of the
box", and we do not plan to write such a converter in the near future.
Yes, PDFsharp with some extra code can do it. But we do not supply
that extra code. On NuGet and other sources you can find a third party
library "HTML Renderer for PDF using PdfSharp" that converts HTML to
PDF. And there may be other libraries for the same or similar
purposes, too. Maybe they work for you, maybe they get you started.
A workaround I think is using DrawToBitmap with a RichTextBox to render the RTF string into an image, then use DrawImage to put it in the pdf file.

Using a RichTextBox would also be my approach, but I would select only a single character in the text and query the relevant properties (subscript, superscript, maybe also bold, italic, underline, and anything else you need). And when any of those properties changes, draw the text you collected so far and continue collecting characters for the new set of properties until any relevant property changes.
I would probably use MigraDoc so I would not have to deal with line-breaks in my code, but that is up to you. I would not create bitmaps for the text as this voids the advantages of the PDF format.

C# PDFSharp: Examples of how to strip text from PDF?

I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".
Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.
Are the examples of how to do it?
Thanks!

Extracting text from a PDF file with PDFsharp is not a simple task.
It was discussed recently in this thread:
https://stackoverflow.com/a/9161732/162529

Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:
var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.
The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.

Example of PDFSharp libraries extracting images from .pdf file:
link
library
EDIT:
Then if you want to extract text from image you have to use OCR libraries.
There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample # codeproject.
EDIT 2 :
If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.

How to create a word document using html written in C#

I creating a C# application that has to create a word document.
I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.
I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.
How can I tell word to reformat the text as html?
edit: (trough the C# code of course)
edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.

If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.
Word will open that just fine.
If you need to add a page break, you can use a CSS page-break-before, like so:
<br style="page-break-before: always;"/>
If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:
you paste HTML from the Clipboard
open/insert HTML from a file
So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop
However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.

As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):
When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:
On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.
So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).
Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>

Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.
See this answer here
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge
And See this from the mothership:
http://msdn.microsoft.com/en-us/library/ff433638.aspx

If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.
You can run this macro programmatically by starting Word using the command line switch /m.

Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.
If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.

If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

How does one parse and convert AutoCAD MText entity to raw text?

I would like to parse AutoCAD's MText entity and extract the raw text. I see a pattern in the way the text is formatted. If this has already been solved, then I would not need to reinvent the wheel. I have searched online, but have not found sufficient information.
I am searching for any links or references on this subject.
Edit:
To further clarify, we are using the ODA (Open Design Aliance) libraries to access the DWG files. I am not familiar with this library. Another developer is using the library and extracting information from the files including MText entities. I am then provided with a file containing the MText text, which is what I am looking at. I am looking at the MText formatted text, which I have access to and am working with in C#.
Questions:
I asked the other developer if the ODA library provided a means to extract the raw text unformatted. His response was that it could, however that it would also result in the entity getting written back to the DWG file. I am interested in the raw text without affecting the original DWG file. Does ODA provide a way of extracting the raw text without altering the file?
I am interested in any documentation on the formatting rules of MText, so that I can consider writing a parser myself if necessary.
Is there anything out there to convert MText to RTF? I realize that RTF would not completely satisfy all formatting rules, but this could provide a satisfactory means of displaying the formatted text in a WinForms app. Given RTF I could also obtain the raw text.

This Forum thread includes a VB program to strip the control characters from the MText. The code indicates what should be done to strip each control character, so it should be straightforward to write something similar in C#.
Additionally, the documentation of the format codes is available in the AutoCAD documentation.

If you are using C# and the .NET interface, the Text property of the MText object provides the raw text:
MText mt;
...
string rawText = mt.Text;
If you want the formatting as well, the solution is different.

If you are parsing an AutoCAD file without AutoCAD, you need to specify what file type you are parsing. However, this question is basically a subset of the following questions:
Are there any libraries for parsing AutoCAD files?
Open source cad drawing (dwg) library in C#
.Net CAD component that can read/write dxf/ dwg files
Reading .DXF files
For DWG, the basic options are Open Design Alliance and AutoCAD RealDWG.
If this doesn't help, please provide more details as to exactly what you are trying to do.

If you are using C#, give the netDXF library a try.
I thought pseudo code should be like this:
DxfDocument dxf = new DxfDocument();
dxf = DxfDocument.Load(openFileDialog1.FileName);//load your file
//This extracts the raw text of your first text obj
dxf.MTexts[0].PlainText;

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.