How does one parse and convert AutoCAD MText entity to raw text? - c#

I would like to parse AutoCAD's MText entity and extract the raw text. I see a pattern in the way the text is formatted. If this has already been solved, then I would not need to reinvent the wheel. I have searched online, but have not found sufficient information.
I am searching for any links or references on this subject.
Edit:
To further clarify, we are using the ODA (Open Design Aliance) libraries to access the DWG files. I am not familiar with this library. Another developer is using the library and extracting information from the files including MText entities. I am then provided with a file containing the MText text, which is what I am looking at. I am looking at the MText formatted text, which I have access to and am working with in C#.
Questions:
I asked the other developer if the ODA library provided a means to extract the raw text unformatted. His response was that it could, however that it would also result in the entity getting written back to the DWG file. I am interested in the raw text without affecting the original DWG file. Does ODA provide a way of extracting the raw text without altering the file?
I am interested in any documentation on the formatting rules of MText, so that I can consider writing a parser myself if necessary.
Is there anything out there to convert MText to RTF? I realize that RTF would not completely satisfy all formatting rules, but this could provide a satisfactory means of displaying the formatted text in a WinForms app. Given RTF I could also obtain the raw text.

This Forum thread includes a VB program to strip the control characters from the MText. The code indicates what should be done to strip each control character, so it should be straightforward to write something similar in C#.
Additionally, the documentation of the format codes is available in the AutoCAD documentation.

If you are using C# and the .NET interface, the Text property of the MText object provides the raw text:
MText mt;
...
string rawText = mt.Text;
If you want the formatting as well, the solution is different.

If you are parsing an AutoCAD file without AutoCAD, you need to specify what file type you are parsing. However, this question is basically a subset of the following questions:
Are there any libraries for parsing AutoCAD files?
Open source cad drawing (dwg) library in C#
.Net CAD component that can read/write dxf/ dwg files
Reading .DXF files
For DWG, the basic options are Open Design Alliance and AutoCAD RealDWG.
If this doesn't help, please provide more details as to exactly what you are trying to do.

If you are using C#, give the netDXF library a try.
I thought pseudo code should be like this:
DxfDocument dxf = new DxfDocument();
dxf = DxfDocument.Load(openFileDialog1.FileName);//load your file
//This extracts the raw text of your first text obj
dxf.MTexts[0].PlainText;

Related

open or convert webarchive File in c#

I am trying to find a way to open or convert a webarchive file to any other format in C#. The goal is an automated import system with as few restrictions on file type as possible. I cannot seem to find any way of converting the file other than using safari to open it.
Unfortunately what you are looking for cannot really be done. A webarchive is a proprietary file type made by Apple to display offline webpages in a Safari. This is a combination of xml, html, and binary data, but there are examples in Objective-C to convert the webarchive to a zip archive that contains the html and embedded images/media that was originally displayed on the website that was saved into the webarchive file.
Here is an Objective-C example from GitHub - WebArchiveExtractor
As for converting to PDF...not sure that can be done, you would be better off printing the webpage to PDF in the first place and then uploading that to your document management system.
Apparently though the webarchive filetype contains XML with binary encoded images/media similar to an MHTML file, so you may be able to figure out the format by viewing them in text editors and then writing a conversion utility, but there is very limited information on the web regarding the internal schema of the webarchive file format, so this may be a daunting task. However, since WebKit is open source you can see their code for created an archive and try to reverse it to build your converter. Here's the source code (in C++) for the archiving features in Safari, which actually looks like they are using mhtml, but I haven't explored deep enough to tell if it's exactly the same format: http://trac.webkit.org/browser/trunk/Source/WebCore/loader/archive
Good Luck!

C# PDFSharp: Examples of how to strip text from PDF?

I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".
Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.
Are the examples of how to do it?
Thanks!
Extracting text from a PDF file with PDFsharp is not a simple task.
It was discussed recently in this thread:
https://stackoverflow.com/a/9161732/162529
Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:
var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.
The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.
Example of PDFSharp libraries extracting images from .pdf file:
link
library
EDIT:
Then if you want to extract text from image you have to use OCR libraries.
There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample # codeproject.
EDIT 2 :
If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.

Converting pdf to text

I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.
There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge
Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.

Programmatically find and replace text in pdf

I'm trying to create a web script that will allow me to alter PDF templates that I have uploaded and re-output them. I have tried Zend already which allows me to write to a PDF but that means leaving the PDF blank in certain space which is to primitive for what I need. PDFFlip was not any better.
We need to implement functionality so we can remove content from the PDF as well as remove and replace. I have looked at CAM::PDF and changepagestring.pl but I'm not sure it's up to the job. I was hard pressed to find any real usage examples and Perl is not a language I have used before.
This is for a web project but I am flexible with the language we use, ideally PHP or ASP.NET C# would be great. Preferably not Java unless there is no other way.
I should also point out that I looked through the FoxitReader SDK without any luck. I never tried to implement it but I found no mention of find and replace like functionality.
You can tinker with PDF text but it is not straight-forward just to search and replace. The text is designed as an end-file format not for easy editing. I wrote a blog post explaining some of the issues at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text
May be as workaround it's better to hold and fill in templates in some format that is more convenient for editing? E.g., you can keep your templates as Microsoft Word templates and then export them to PDF after filling. This thread may be useful on this way.
PDF file format isn't quite appropriate for editing.
Alternatively, you may prepare your templates as PDFs containing form fields. In this case filling of form fields is common and well-known task and there a lot of pdf components for this.

What's the best way to save a RichTextFile in C#?

I'm trying to create a notepad/wordpad clone. I want to save it in .rtf format so that it can be read by wordpad. How do I save a do this in C#?
Assuming you are trying to do this yourself for learning purposes, you don't want to use a library to do it for you:
Basically you need to write a program which holds in memory either a string with RTF markup, or a DOM-like data tree. Using the RTF specification, you should write a flat (text) file marked up properly and with a .rtf extension. Just the same as if you were writing an HTML file.
Correct me if I'm wrong, but if you're using the RichTextBox control, you can just use the RichTextBox.SaveFile method to accomplush this. Just a guess though that you mean doing it without using that control.
RTF SpecLink
create the xml spec based on their api and you can make your app compatible with wordpad, word etc.

Categories