I come from a world (iOS and OSX) where saving a rich text string (NSAttributedString) to a file is easy (NSAttributedString can serialise itself as RTF or HTML etc.)
I'm now porting my app to Windows/C# and trying to figure out how to load/save RTF (so that my apps can share a file format)
It looks like I need to create a control (RichTextBox) in order to import RTF - which is OK for testing, but I need to read in thousands of rich text snippets so this is (probably) going to be a bit of a bottleneck.
How is this type of cross platform rich text format best achieved? I'm starting to think I will need to copy the .docx approach and create XML with textRun elements etc? This strikes me as a problem that must have been solved already - but google is not forthcoming!!
Thanks
Related
I have PDF document data with table structure format and I would like to convert that PDF file into a text file with the same structure with margin and spaces between text in pdf
You need to write your own PDF tool then. Which is not exactly an easy task. Honestly, 3rd party tools make your job much easier, why don't you want to use one?
If you change your mind, I can suggest iTextSharp. I've used it in the past with great success. Here are some example to get you going:
http://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C
ps. there are 3 tools used in there.
I need to create a C# or C++ (MFC) application that converts pdf files to txt. I need not only to convert, but remove headers, footers, some garbage characters on the left margin etc. Thus the application shold allow the user to set page margins to cut off what is not needed. I actually have already created such an application using xpdf, but it gives me some problems when I am trying to insert custom tags into the extracted text to preserve italics and bold. Maybe somebody could suggest something useful?
Thanks.
There are shareware and freeware utilities out there. Try fetching their source code, or perhaps use them the way they are.
A public version of the PDF specification can be found here: Adobe PDF Specification
PDF Shareware readers can be found: PDF Reader source code # SourceForge
Please look at Podofo. It's a LGPL-licensed library that has many powerful editing features. One of it's examples, txt2pdf IIRC, is a good start: it shows basic text-extraction; From there you can check if pre (in pdf engine) or post (in text) filtering suffices to your goals. I didn't get to use Pdf Hummus, but it's supposed to have these capabilities too, although it's less straightforward.
I would like to parse AutoCAD's MText entity and extract the raw text. I see a pattern in the way the text is formatted. If this has already been solved, then I would not need to reinvent the wheel. I have searched online, but have not found sufficient information.
I am searching for any links or references on this subject.
Edit:
To further clarify, we are using the ODA (Open Design Aliance) libraries to access the DWG files. I am not familiar with this library. Another developer is using the library and extracting information from the files including MText entities. I am then provided with a file containing the MText text, which is what I am looking at. I am looking at the MText formatted text, which I have access to and am working with in C#.
Questions:
I asked the other developer if the ODA library provided a means to extract the raw text unformatted. His response was that it could, however that it would also result in the entity getting written back to the DWG file. I am interested in the raw text without affecting the original DWG file. Does ODA provide a way of extracting the raw text without altering the file?
I am interested in any documentation on the formatting rules of MText, so that I can consider writing a parser myself if necessary.
Is there anything out there to convert MText to RTF? I realize that RTF would not completely satisfy all formatting rules, but this could provide a satisfactory means of displaying the formatted text in a WinForms app. Given RTF I could also obtain the raw text.
This Forum thread includes a VB program to strip the control characters from the MText. The code indicates what should be done to strip each control character, so it should be straightforward to write something similar in C#.
Additionally, the documentation of the format codes is available in the AutoCAD documentation.
If you are using C# and the .NET interface, the Text property of the MText object provides the raw text:
MText mt;
...
string rawText = mt.Text;
If you want the formatting as well, the solution is different.
If you are parsing an AutoCAD file without AutoCAD, you need to specify what file type you are parsing. However, this question is basically a subset of the following questions:
Are there any libraries for parsing AutoCAD files?
Open source cad drawing (dwg) library in C#
.Net CAD component that can read/write dxf/ dwg files
Reading .DXF files
For DWG, the basic options are Open Design Alliance and AutoCAD RealDWG.
If this doesn't help, please provide more details as to exactly what you are trying to do.
If you are using C#, give the netDXF library a try.
I thought pseudo code should be like this:
DxfDocument dxf = new DxfDocument();
dxf = DxfDocument.Load(openFileDialog1.FileName);//load your file
//This extracts the raw text of your first text obj
dxf.MTexts[0].PlainText;
I'm trying to create a web script that will allow me to alter PDF templates that I have uploaded and re-output them. I have tried Zend already which allows me to write to a PDF but that means leaving the PDF blank in certain space which is to primitive for what I need. PDFFlip was not any better.
We need to implement functionality so we can remove content from the PDF as well as remove and replace. I have looked at CAM::PDF and changepagestring.pl but I'm not sure it's up to the job. I was hard pressed to find any real usage examples and Perl is not a language I have used before.
This is for a web project but I am flexible with the language we use, ideally PHP or ASP.NET C# would be great. Preferably not Java unless there is no other way.
I should also point out that I looked through the FoxitReader SDK without any luck. I never tried to implement it but I found no mention of find and replace like functionality.
You can tinker with PDF text but it is not straight-forward just to search and replace. The text is designed as an end-file format not for easy editing. I wrote a blog post explaining some of the issues at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text
May be as workaround it's better to hold and fill in templates in some format that is more convenient for editing? E.g., you can keep your templates as Microsoft Word templates and then export them to PDF after filling. This thread may be useful on this way.
PDF file format isn't quite appropriate for editing.
Alternatively, you may prepare your templates as PDFs containing form fields. In this case filling of form fields is common and well-known task and there a lot of pdf components for this.
I want to parse a PDF file from my C# app and create an audio file off it.
How would I do that ?
I'm particularly looking for a good pdf to text library or a way to strip a pdf file off its text.
You preferably have a tagged PDF document as your input document. This means that the document contains tags to mark up the logical structure of the document (typically a PDF document will only contain visual information).
This PDF could then be converted into DAISY format, which is a standard for digital talking books, i.e. an intermediate XML format storing the text of books along with the logical structure and navigation features.
This Daisy XML format can be either converted to an audio format, or you could be using a Daisy reader, a physical device like an MP3 player to listen to the book.
There is a presentation available at the Daisy web site explaining the principles of this toolchain:
Accessible PDF to DAISY/NIMAS Conversion
Use Festival for the text to speech. Various pdf to text api's exist...
You need the Speech SDK from Microsoft. Read an instruction here
As the other posters outlined, first you have to extract the text from the .pdf file. pdf files are an open format now, so you can probably find a parser through Google.
Then you have to extract the text you want to convert to speech from the file, ignoring things like figure titles, page headers, table of contents etc.
Once you've got the text, you need to convert it to speech. This is probably the hardest part.
A while ago I was fiddling around with generating voice files for a gaming mod, since I'm a rotten voice actor.
Cepstral had the best TTS converters I could find. (The free ones had an annoying tendency to insert Cepstral advertisements in the speech, but I could manually edit this out for what I was doing.)
It turns out that there's a speech synthesis markup language which can be used to provide clues to the TTS converter about which syllable to place accents, etc. Here's a linky:
http://www.w3.org/TR/speech-synthesis/
How you go about automatically adding the SSML to the text is a bit beyond me.
Anyway, the TTS converter will produce an audio file, and the final step would be to compress the audio at the desired bit rate in mp3 format.
If your sole task is to listen to speech synthesized text from a PDF, how about the Acrobat "Read out loud" function at the bottom of the "View" menu?
I guess it's a hard thing to do. Firstly you need to read the text in that pdf, and then use some mechanism of synthetic voice generation to create the audio content. Then you have to store it as an mp3.
On Mac OS X, you can extract the text of the pdf and then pipe it in "say". You should find equivalent synthetisers on other OS.
It's not all that complicated to do, provided that you don't re-invent the wheel, but instead simply reuse existing technology (i.e. text to speech engines like festival), as well as OCR engines to process the PDF files.
The most complicated thing probably is to work with different PDF layouts (columns, rows, embedded graphics,foot notes, URLs etc), which may obfuscate the text recognition process.
However, in general (if this is not supposed to be a learning experience), it is certainly easier to just resort to using existing software solutions:
Visual Text to Speech
Text to speech tools
zero 2000