deleting text from a Word Interop document - c#

I am having trouble trying to remove a list of data/text from a Word document using Word Interop. So far I thought that I could read through the document to find the starting text, then find the ending text, and save each of those index's to their own variable. Next I would just loop through the data from the starting index to the ending index and delete all the text in between.
Problem is that it works incorrectly and doesn't provide expected results. I must not be understanding how the Range interface works in document.Paragraphs[i+1].Range.Delete();. It deletes some lines but not all, and seems to go beyond the paragraphs that I care about to delete. What am I missing? There must be a better way to do this. Documentation seems low with Interop.
string text = " ";
int StartLocation = 0;
int EndLocation = 0;
//I roughly know the starting location
//starting at I=2248 so I don't
//through entire document
for (int i = 2248; i < 2700; i++)
{
text = document.Paragraphs[i + 1].Range.Text.ToString();
if (text.Contains("firstWordImSearchingFor"))
{
StartLocation = i;
}
if (text.Contains("lastWordImSearchingFor"))
{
EndLocation = i;
}
}
//delete everything between those paragraph locations
//(not working correctly/ skips lines)
for(int i = StartLocation; i<EndLocation-1i++)
{
document.Paragraphs[i+1].Range.Delete();
}

The drawback to the approach you're trying is that the Start and End locations (number of characters from the beginning of the Document story) will vary depending on what non-visible / non-printing characters are present. Content Controls, field codes and other things affect this - all in different ways depending on how things are being queried.
More reliable would be to store the starting point in one Range then extend it to the end point.
I also recommend using Range.Find to search for the start and end points.
Bare-bones pseudo-code example, since I don't really have enough information to go on to give you full, working code:
Word.Range rngToDelete = null;
Word.Range rngFind = document.Content;
bool wasFound = false;
object missing = System.Type.Missing;
object oEnd = Word.WdCollapseDirection.wdCollapseEnd;
wasFound = rngFind.Find.Execute("firstWordImSearchingFor", ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
if (wasFound)
{
rngToDelete = rngFind.Duplicate //rngFind is now where the term was found!
//reset the range to Find so it moves forward
rngFind.Collapse(ref oEnd);
rngFind.End = Document.Content.End
wasFound = rngFind.Find.Execute("lastWordImSearchingFor", ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
if (wasFound)
{
rngToDelete.End = rngFind.End;
rngToDelete.Delete();
}
}

This is completely untested and is offered as a suggestion:
var docRange = document.Content;
bool inDelete = false;
foreach(var para in docRange.Paragraphs)
{
if(para.ToString().Contains("Start flag") || inDelete)
{
inDelete = true;
docRange.Delete(para);
}
if (para.ToString().Contains("End flag"))
{
// remove following line to retain this paragraph
docRange.Delete(para);
break;
}
}

Related

How to reverse paragraphs with Interop.Word

I am trying to reverse document paragraphs with the following code:
using Word = Microsoft.Office.Interop.Word;
object filePath = #"input.docx";
Word.Application app = new();
app.Visible = false;
object missing = System.Type.Missing;
object readOnly = false;
object isVisible = false;
Word.Document doc = app.Documents.Open(
ref filePath,
ref missing, ref readOnly, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref isVisible, ref missing,
ref missing, ref missing, ref missing);
try
{
Word.Range cachedPara2 = doc.Paragraphs[2].Range.Duplicate;
doc.Paragraphs[2].Range.FormattedText = doc.Paragraphs[1].Range.FormattedText;
doc.Paragraphs[1].Range.FormattedText = cachedPara2.FormattedText;
doc.SaveAs(#"output.docx");
}
finally
{
doc.Close();
app.Quit();
}
I expect this:
but the actual result is this:
How to get expectations?
UPDATE
With the answer below, I was able to get the expected result for my first case.
Now, in another case, I wanna do the following:
Unfortunately, I couldn't quite figure it out how .Collabse() method works. I am trying to do it with .InsertParagraphAfter():
doc.Paragraphs[2].Range.InsertParagraphAfter();
doc.Paragraphs[3].Range.FormattedText = doc.Paragraphs[5].Range.FormattedText;
doc.Paragraphs[5].Range.FormattedText = doc.Paragraphs[2].Range.FormattedText;
doc.Paragraphs[2].Range.Delete();
Where does this empty paragraph come from? How avoid it?
A range object does not have any content itself, it merely points to the location of the content, rather like a set of map co-ordinates.
What you need to do is add the content of the second paragraph before the first, which will create a new first paragraph. You can then delete what is now the third paragraph. For example:
Word.Range target = doc.Paragraphs[1].Range;
target.Collapse wdCollapseStart;
target.FormattedText = doc.Paragraphs[2].Range.FormattedText;
doc.Paragraphs[3].Range.Delete;

Issue with getting bulleted list from docx

I am trying to get bullet list from docx using C#, however, instead of bullet number or symbol, the output gets a weird symbol or rectangle with a question mark inside "", instead of a bullet. Following is the code snippet I am using to get bullet value, but it is not working:
Application word = new Application ();
Document doc = new Document ();
object fileName = #"D:\testing\Sample_2.docx";
// Define an object to pass to the API for missing parameters
object missing = System.Type.Missing;
doc = word.Documents.Open (ref fileName,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);
for (int i = 0; i < doc.Paragraphs.Count; i++) {
string bullet = doc.Paragraphs[i + 1].Range.ListFormat.ListString;
}
((_Document) doc).Close ();
((_Application) word).Quit ();

How to link document to different structured template

I am try to automate a process for changing the document templates of word files.
If the templates are similar structure, ie they both use heading1, then when the document is linked to the new template, it works.
However, the template structure is completely different, heading1 is no longer used, it is now section1. How can I change these section titles with code? Something along the lines of if(heading1) rename to section1;
I am using Interop.Word to perform these operations.
Below is the code I'm using:
public string UpdateDocumentWithNewTemplate(string document, string theme, string template, Word.Application wordApp)
{
try
{
object missing = System.Reflection.Missing.Value;
Word.Document aDoc = null;
object notReadOnly = false;
object isVisible = false;
wordApp.Visible = false;
// create objects from variables for wordApp
object documentObject = document;
// open existing document
aDoc = wordApp.Documents.Open(ref documentObject, ref missing, ref notReadOnly, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref isVisible,
ref missing, ref missing, ref missing, ref missing);
aDoc.Activate();
// set template and theme to overwrite the existing styles
aDoc.CopyStylesFromTemplate(template);
aDoc.ApplyDocumentTheme(theme);
aDoc.UpdateStyles();
// save the file with the changes
aDoc.SaveAs(ref documentObject, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
// close the document
aDoc.Close(ref missing, ref missing, ref missing);
if (aDoc != null)
System.Runtime.InteropServices.Marshal.ReleaseComObject(aDoc);
aDoc = null;
return documentObject.ToString();
}
catch (Exception exception)
{
return "Error: " + exception;
}
}
For the specific example you need to first import the styles from the other template, then do a Find/Replace to replace the styles applied. I see from your code that you've got the first part (aDoc.CopyStylesFromTemplate(template); aDoc.ApplyDocumentTheme(theme); aDoc.UpdateStyles();).
What many don't realize about Word's Find/Replace functionality is that it can also work with formatting. The best way to get the necessary syntax is to record a successful Find/Replace in a macro, then port the VBA to C#. In the UI:
Ctrl+H to open the Replace dialog box
With the cursor in the "Find what" box, click "More" then "Format" and choose "Style"
Select the name of the style you want to find and have replaced
Click in the "Replace with" box
Use Format/Style, again, to choose the style you want to use
Click "Replace All".
Here's the result I get:
Selection.Find.ClearFormatting
Selection.Find.Style = ActiveDocument.styles("Heading 1")
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Style = ActiveDocument.styles("section2")
With Selection.Find
.Text = ""
.Replacement.Text = ""
.Forward = True
.wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchByte = False
.CorrectHangulEndings = False
.HanjaPhoneticHangul = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
You should use Range, not Selection. So the C# code would look something like the following code block. Note how
I get the Range of the entire document
Create a Find object for the Range and use that
To reference Styles for the Find; I show two possibilities
You can list almost all the properties for Find before using Find.Execute. It would also be possible to create object objects for each of these, with only one necessary for true and false then list these "by ref" in Find.Execute. As far as I know, this is simply a matter of personal preference. I did it this way to the most literal "translation" of the VBA to C# code.
In any case, Find.Execute "remembers" these settings, so ref missing can then be used for all the parameters you don't set specifically. In this case, only the "replace all" command is used specifically in the method.
Word.Document doc = wdApp.ActiveDocument;
Word.Range rngFind = doc.Content;
Word.Find fd = rngFind.Find;
fd.ClearFormatting();
Word.Style stylFind = doc.Styles["Heading 1"];
fd.set_Style(stylFind);
fd.Replacement.ClearFormatting();
fd.Replacement.set_Style(doc.Styles["section2"]);
fd.Text = "";
fd.Replacement.Text = "";
fd.Forward = true;
fd.Wrap = Word.WdFindWrap.wdFindStop;
fd.Format = true;
fd.MatchCase = false;
fd.MatchWholeWord = false;
fd.MatchByte = false;
fd.CorrectHangulEndings = false;
fd.HanjaPhoneticHangul = false;
fd.MatchWildcards = false;
fd.MatchSoundsLike = false;
fd.MatchAllWordForms = false;
object replaceAll = Word.WdReplace.wdReplaceAll;
object missing = Type.Missing;
fd.Execute(ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing,
ref replaceAll, ref missing, ref missing, ref missing, ref missing);

Read a Word Document Using c#

I need to start reading a word document from a specific point.
That key word is taken from a dropdown combo box.
The keyword is something like [blah blah, blah, 001]
So, I need to read only the content from that keyword to next heading ...
I used this to read heading numbers and line by line
but heading num notworking
string headNum = objparagraph.Range.ListFormat.ListString;
string sLine = objparagraph.Range.Text;
Word.Application word = new Word.Application();
Word.Document doc = new Word.Document();
object fileName = #"C:\wordFile.docx";
// Define an object to pass to the API for missing parameters
object missing = System.Type.Missing;
doc = word.Documents.Open(ref fileName,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);
string ReadValue = string.Empty;
// Activate the document
doc.Activate();
foreach (Word.Range tmpRange in doc.StoryRanges)
{
ReadValue += tmpRange.Text;
}
If I understood correctly, you need to read the Word document starting from your keyword to next heading. In other words, something like the red text in the following document:
In that case, here is how you can accomplish that with GemBox.Document:
string keyword = " [blah blah, blah, 001]";
DocumentModel document = DocumentModel.Load("input.docx");
ContentPosition start = document.Content
.Find(keyword)
.First()
.End;
ContentPosition end = new ContentRange(start, document.Content.End)
.GetChildElements(ElementType.Paragraph)
.Cast<Paragraph>()
.First(p => p.ParagraphFormat.Style != null && p.ParagraphFormat.Style.Name.Contains("heading"))
.Content
.Start;
string text = new ContentRange(start, end).ToString();
The text variable's value will be:
Sample text content that we want to retrieve.
Another sample paragrap.
Also, here are additional Reading and Get Content examples, they contain some useful information.

Need suggestions on how to extract data from .docx/.doc file then into SQL Server

I'm suppose to develop an application for my project, it will load past-year examination / exercises paper (word file), detect the sections accordingly, extract the questions and images in that section, and then store the questions and images into the database. (Preview of the question paper is at the bottom of this post)
So I need some suggestions on how to extract data from a word file, then inserting them into a database. Currently I have a few methods to do so, however I have no idea how I could implement them when the file contains textboxes with background image. The question has to link with the image.
Method One (Make use of ms office interop)
Load the word file -> Extract image,
save into a folder -> Extract text,
save as .txt -> Extract text from .txt then store in db
Questions:
How do I detect the section and question?
How do I link the image to the question?
Extract text from word file (Working):
private object missing = Type.Missing;
private object sFilename = #"C:\temp\questionpaper.docx";
private object sFilename2 = #"C:\temp\temp.txt";
private object readOnly = true;
object fileFormat = Word.WdSaveFormat.wdFormatText;
private void button1_Click(object sender, EventArgs e)
{
Word.Application wWordApp = new Word.Application();
wWordApp.DisplayAlerts = Word.WdAlertLevel.wdAlertsNone;
Word.Document dFile = wWordApp.Documents.Open(ref sFilename,
ref missing, ref readOnly, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);
dFile.SaveAs(ref sFilename2, ref fileFormat, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,ref missing,
ref missing,ref missing,ref missing,ref missing,ref missing,
ref missing,ref missing);
dFile.Close(ref missing, ref missing, ref missing);
}
Extract image from word file (doesn't work on image inside textbox):
private Word.Application wWordApp;
private int m_i;
private object missing = Type.Missing;
private object filename = #"C:\temp\questionpaper.docx";
private object readOnly = true;
private void CopyFromClipbordInlineShape(String imageIndex)
{
Word.InlineShape inlineShape = wWordApp.ActiveDocument.InlineShapes[m_i];
inlineShape.Select();
wWordApp.Selection.Copy();
Computer computer = new Computer();
if (computer.Clipboard.GetDataObject() != null)
{
System.Windows.Forms.IDataObject data = computer.Clipboard.GetDataObject();
if (data.GetDataPresent(System.Windows.Forms.DataFormats.Bitmap))
{
Image image = (Image)data.GetData(System.Windows.Forms.DataFormats.Bitmap, true);
image.Save("C:\\temp\\DoCremoveImage" + imageIndex + ".png", System.Drawing.Imaging.ImageFormat.Png);
}
}
}
private void button1_Click(object sender, EventArgs e)
{
wWordApp = new Word.Application();
wWordApp.Documents.Open(ref filename,
ref missing, ref readOnly, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);
try
{
for (int i = 1; i <= wWordApp.ActiveDocument.InlineShapes.Count; i++)
{
m_i = i;
CopyFromClipbordInlineShape(Convert.ToString(i));
}
}
finally
{
object save = false;
wWordApp.Quit(ref save, ref missing, ref missing);
wWordApp = null;
}
}
Method Two
Unzip the word file (.docx) -> Copy the media(image) folder, store somewhere -> Parse the XML file -> Store the text in db
Any suggestion/help would be greatly appreciated :D
Preview of the word file:
(backup link: http://i.stack.imgur.com/YF1Ap.png)
The answer is choice #3 - the OpenXML SDK. First let me explain why you don't want the choices listed above.
Running Office on the server is a bad idea. Microsoft specifically says don't do it. It's slow and you will hit "issues" where it throws exceptions or just fails to find things.
Parsing the XML file will work but the XPath to find every possible case where the images, etc. are located adds up. You would probably have to iterate on sections, which come at the end of each section, then handle all cases of in a cell, in a textbox, positioned, inline, etc.
If you go with the OpenXML SDK you have a LINQ interface where you can then use the Descendents and get everything that is an image (or whatever you need). It also gives you sections by the SectPr node so you can easily iterate over sections.

Categories