How to extract text from Word files using C#?

How to extract text from Word files using C#? - c#

I am trying to convert a large number (100,000) of word DOC files, these are quite old. From around 1995 to 2000 version of Word, i supposed. I keep going around in circles from what i see here in stack overflow and the MS documentation.
What i want do so is simply read the file, stick the text into a string, parse the string, take out the structure stuff (the file is actually a structured report, looks like Patient: Jon Doe). At that point, I know what i am doing. I can parse the string data, stick it into useful variables, then stick this data into a database. But I do not know how to actually put the text into a string. Any help?
PPS i found this reference which supposedly puts a DOC file into a text file. It's a start, but i'd rather avoid doing a bunch of file manipulations.

If you try to use the Word object model, you must always instantiate a certain version of Word on the client (since running Word on a server is not recommended). Unfortunately, you'll depend of the restriction of Word concerning older files, e.g. in Word 2010 you can open files from Office 95 only in sandbox mode (i.e you're not able to access the file content programmatically). Additionally, you'll have to deal with unknown template content (documents with macros attached, for example).
In your case I'd rather look for a 3p-component which allows to access the content.
I know from document management systems like OpenText eDocs and Autonomy iManage that they use other tools to full-index documents of all types and can present the content in a viewer application. So if you look in this direction, may be you find something useful.

A word file is just a normal file as far as your code goes.
Try this:
using System.IO;
StreamReader streamReader = new StreamReader(filePath);
string text = streamReader.ReadToEnd();
streamReader.Close();

Related

How to parse an ncx file using c#

I am trying to create a windows phone app for reading e-pubs. I extracted the content and now I want to read the ncx file. But when I try to use System.Xml.Serialization.XmlSerializer it is telling me unknown field in the second line itself. Please help

Here is how the basic approach to read an epub file
Treat the EPUB file as a ZIP archive and read it using the Windows
built-in ZIP archive reader, ZipArchive
In the archive, find the file META-INF/container.xml and look in it
to find the full-path attribute of the root-file element. That gives
you the path to the OPF file (probably something like
OPS/content.opf) The 'manifest' element of the OPF file will tell
you the names of all the files that make up the book. The 'spine'
element will tell you the order in which they appear in the book (and
will include a reference, via the 'toc' attribute of the spine
element, to a table-of-contents file that will usually be in NCX
format)
Normally, the EPUB book will consist of series of XHTML files, each
file containing one 'chapter' of the book. The basic procedure to
display a book for reading would be:
figure out which chapter the user wants to look at
load the XHTML for that chapter into a WebView (or some other solution for rendering XHTML on screen)
Problems you are likely to encounter:
Many EPUB books are created using ZIP-generators that, although
compatible with the ZIP standard, are incompatible with the
ZIP-reader APIs built into the OS. You will probably need to use a
third party library like DotNetZip or SharpZipLib (but be careful of
the licence conditions for the latter).
You will need to do some work to display images in the WebView,
especially if you try to cover all the image types that are part
of the EPUB standard.
It will be fiddly to find and apply all the CSS styles that the EPUB
book defines.
You will probably want to display a 'paged' view of the chapter,
rather than displaying it as a long vertically scrollable column.
That will involve some funky javascript work.
You may find that an individual EPUB chapter is too large for
displaying in a WebView. In the end, you may decide that all the
limitations of WebView mean you will be better off writing your own
custom XHTML-parsing rendering solution, and displaying using
TextBlocks, or something more exotic (you can use C++ interop
code and the D2D Font APIs)

To parse .epub file you may want to use a library :
Epubreader
Epubsharp
DotNetEpub
Aspose.Words for .NET
source on SO : 1 2 3 4

Reading Data From PDF File And Write It Into Word File?

How Can We Read Data From PDF File And Write It In Word File Using Asp.net C# Code...?

You can use the IFilter capabilities built into Windows, here's an article with some example code:
Using-IFilter-in-C
The issue with PDF files is that even if you're able to extract the plaintext of the PDF in readable form (which is not a guarantee by any stretch), the text will be completely unformatted. Even simple things like line breaks will be lost in many cases.

How to create a word document using html written in C#

I creating a C# application that has to create a word document.
I'm using the Microsoft.Office.Interop.Word to do this and I've successfully managed to output some word documents, but creating the content trough the code is a very time consuming work.
I noted that word is able to open html pages and show it as a normal content so I created a simple test table in html and inserted it into the word document. But when I outputted the document the obvious happened: The tags where still there! Word did not format the tags as html. It just outputted exactly what I put in there.
How can I tell word to reformat the text as html?
edit: (trough the C# code of course)
edit 2: Please note that I'm parsing trough some data to make this, so I will end up with about 4 pages of the same table/html, so I will need to be able to tell word to start at the next page each time I've finished a loop. So a html-only method will probably not work.

If you're only wanting to output simple HTML content as a Word document, you could always cheat and write out the HTML content with a .doc extension.
Word will open that just fine.
If you need to add a page break, you can use a CSS page-break-before, like so:
<br style="page-break-before: always;"/>
If you're set on using Interop, having read up a little bit, this post states that you need a converter to insert HTML, and the converters are only accessible when:
you paste HTML from the Clipboard
open/insert HTML from a file
So, this answer looks like it provides a clipboard-based solution : Adding html text to Word using Interop
However, if there's any money to spend on the project, I can heartily recommend Aspose.Words which will do all of this for you.

As requested by the OP, and to make easier for others to find this solution, here it goes the answer I posted as a comment (plus extra results from testing):
When opening an HTML file, MS Word honors the CSS properties page-break-before and page-break-after. There is a caveat, however:
On "Web design" view, page-breaks are never shown (this doesn't mean that they aren't there), just like browsers don't "show" them. And Word opens html files on Web design view by default (which quite makes sense). You need to print the document or switch to some other view (typicall "Print design") to see your breaks in all their glory.
So, saving an HTML file with a .doc extension is a viable solution (also tested: Word opens it properly despite of the extension).
Note: all the testing was done on MS Word 2003 using this snippet: <html>asdf<br style="page-break-before: always;">new page!</html>

Don't build the document in code, create it in Word as template or mail merge template and the use code to merge or replace the fields data.
See this answer here
MS Word Office Automation - Filling Text Form Fields And Check Box Form Fields And Mail Merge
And See this from the mothership:
http://msdn.microsoft.com/en-us/library/ff433638.aspx

If you don't want to use an external lib, Interop is too slow for you and neither pure HTML nor mail merge template are flexible enough, you could write your content as text or HTML into one or more files (using C#), create a VBA macro in a Word document which by itself creates a second Word document, reads the content files and does any formatting you want afterwards.
You can run this macro programmatically by starting Word using the command line switch /m.

Another possible approach, if your html is xhtml (i.e. XML compliant), you could use XSLT to convert it to a Word XML format. But this would take a LOOOOOOOOOOONG time to code.
If you don't have to use HTML as the starting point you could simply build the Word XML document yourself rather than using XSLT, which would be easier. Time consuming but possible - it's something I do quite a lot in my work.

If a third party component is an option I would recommend the stuff from Aspose.
I have been pretty happy with their tools so far. The API is a little messy but everything works as one would expect.

Word Automation Multiple Paste Problem

Is there a better way to paste HTML fragments into a Word document than via the clipboard from C#?
using Word = Microsoft.Office.Interop.Word;
I'm using some code that puts HTML into the clipboard:
HtmlFragment.CopyToClipboard(changedText);
I have a selection in word (from a formfield) and I do:
word.Selection.Paste();
But sometimes it just throws a COM exception. If I add
Thread.Sleep(100);
I can get it to work, but that's not ideal.
The Insert methods look like a better option but there is no Insert from HTML.
So what's the best way to insert lots of HTML fragments into Word quickly using the automation interfaces?
Edit
Some good advice in the responses but the issue turned out to be a simple <br> tag causing word to fail on paste.

For interop, instead of Selection.Paste you'll want to use Selection.PasteSpecial with a WdPasteDataType of wdPasteHTML.
If you're using the new formats of Word (i.e. 2007/2010), you could give up interop all together and just go with WordprocessingML (using the Open XML SDK or just free-hand it with Linq and System.IO.Packaging). Or you could just it in conjunction with Interop if that was a need.
If you're using Open XML, you could just use altChunk to import HTML. Here's an example (which includes an example for HTML) at How to Use altChunk for Document Assembly. And another (fresh off the presses - it was released today): Importing HTML that contains Numbering using altChunk.

+1 to Otaku's comments, though, generally speaking, i've found it best to use the various RANGE.* functions for pasting in data than the Selection object, or pasting through the clipboard. the main reason is, if you paste through the clipboard, you scramble whatever was on the clipboard (which might not be what the user wants to happen).
the Selection object applies across all open word documents, which can get you in trouble in some cases. Unfortunately there are a few things that you just about can't do any other way.
And, there are some things (Like altering text at the current cursor position) that you MUST use the selection object for.

+1 to DarinH comments. Also something to note is that you can paste on any place in the document using Range without having to change the selection of the document (the cursor in the document).
Sometimes PasteAndFormat throws an Exception on freshly created Documents, check my reply here if that happens: https://stackoverflow.com/a/65796482/15001063

Is there a way to generate word documents dynamically without having word on the machine

I am planning on generating a Word document on the webserver dynamically. Is there good way of doing this in c#? I know I could script Word to do this but I would prefer another option.

I've worked at a company in the past that really wanted generated word documents, in the end they were perfectly satisfied with RTF docs that had a ".doc" extension. Word has no problem recognizing and opening them.
The RTF docs were generated with iText.net (free .net library), the API is pretty easy to use, performs extremely well, you don't need word on the machine, also, you could extend to generating PDF, HTML, and Text docs in the future with very little effort. After four years the solution I created is still in place, so that's a little testimony in iText.net's favor.
It looks like the official iText page suggests that iText Sharp is the best .Net choice right now, so that's another option

You'd be better off generating an rtf file, which word will know how to open.

If want to generate Office 2007 documents check the Open XML File Formats, they're simple zipped XML files, check this links:
Open XML File Formats: What is it, and how can I get started?
Introducing the Office (2007) Open XML File Formats
Edit: Check this project, can serve you as a good starting point:
DocumentMaker
Seems very simple and customizable, look this code snippet:
Paragraph p = new Paragraph();
p.Runs.Add(new Run("Text can have multiple format styles, they can be "));
p.Runs.Add(new Run("bold and italic",
TextFormats.Format.Bold | TextFormats.Format.Italic));
doc.Paragraphs.Add(p);

Word will quite happily open a HTML with a .doc extension. If you include an internal style sheet, you can have it fully formatted. There was previous post on this subject:
Export to Word Document in C#

Creating the old .DOC files (pre-Word 2007) is nigh-impossible without Word itself. The format is just too complex. Microsoft has released the format description, but it's enough to reduce a grown programmer to tears. There is a reason for that too (historical), but that doesn't make things better.
The new .DOCX would be easier, although quite a bit of hassle still. However depending on which Word versions you are targeting, there are some other options too.
For one, there is the classic .RTF. The format is pretty complex still, yet well documented and has strong support across many applications and platforms. And you might use some string-replacing into template files to make things easier (it's non-binary).
Then there are the "old" Word XML files. I think they worked starting with Word XP. Kinda the predecessors of .DOCX. I've used them, not bad. And the documentation is pretty OK.
Finally, the easy way that I would choose, is to make a simple HTML. Word can load HTML files just fine starting with version 2000. In the simplest way just change the extension of a HTML file to .DOC and you have it. You can also add a few word-specific tags and comments to make it look even better in Word. Use the Word's Save As...HTML option to see what they are.

There are third party libraries about that will do the job.
Doing a quick google came up with this one, for example.
I haven't tried any, so I can't give you specific advice, I'm afraid!
Let us know how you get on...

In Office 2007 Microsoft introduced a new file format called the Microsoft Open Office XML Format (.docx). This format is not compatible with older versions of Microsoft Word. Since this is XML you can create or read with out having a Word installed.

Here is the component that generates document based on the custom template. The documents are generated from the sharepoint list ... so the data is pulled from the list item into the document on the fly:
http://store.sharemuch.com/products/generate-word-documents-from-sharepoint-list
Hope that helps,
Yaroslav Pentsarskyy
Blog: www.sharemuch.com

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract text from Word files using C#? - c#

A word file is just a normal file as far as your code goes. Try this: using System.IO; StreamReader streamReader = new StreamReader(filePath); string text = streamReader.ReadToEnd(); streamReader.Close();

Related

How to parse an ncx file using c#

Reading Data From PDF File And Write It Into Word File?

How to create a word document using html written in C#

Word Automation Multiple Paste Problem

Is there a way to generate word documents dynamically without having word on the machine

Categories

Resources