Read a Word Document Using c#

Read a Word Document Using c# - c#

I need to start reading a word document from a specific point.
That key word is taken from a dropdown combo box.
The keyword is something like [blah blah, blah, 001]
So, I need to read only the content from that keyword to next heading ...
I used this to read heading numbers and line by line
but heading num notworking
string headNum = objparagraph.Range.ListFormat.ListString;
string sLine = objparagraph.Range.Text;

Word.Application word = new Word.Application();
Word.Document doc = new Word.Document();
object fileName = #"C:\wordFile.docx";
// Define an object to pass to the API for missing parameters
object missing = System.Type.Missing;
doc = word.Documents.Open(ref fileName,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing);
string ReadValue = string.Empty;
// Activate the document
doc.Activate();
foreach (Word.Range tmpRange in doc.StoryRanges)
{
ReadValue += tmpRange.Text;
}

If I understood correctly, you need to read the Word document starting from your keyword to next heading. In other words, something like the red text in the following document:
In that case, here is how you can accomplish that with GemBox.Document:
string keyword = " [blah blah, blah, 001]";
DocumentModel document = DocumentModel.Load("input.docx");
ContentPosition start = document.Content
.Find(keyword)
.First()
.End;
ContentPosition end = new ContentRange(start, document.Content.End)
.GetChildElements(ElementType.Paragraph)
.Cast<Paragraph>()
.First(p => p.ParagraphFormat.Style != null && p.ParagraphFormat.Style.Name.Contains("heading"))
.Content
.Start;
string text = new ContentRange(start, end).ToString();
The text variable's value will be:
Sample text content that we want to retrieve.
Another sample paragrap.
Also, here are additional Reading and Get Content examples, they contain some useful information.

Related

How to reverse paragraphs with Interop.Word

I am trying to reverse document paragraphs with the following code:
using Word = Microsoft.Office.Interop.Word;
object filePath = #"input.docx";
Word.Application app = new();
app.Visible = false;
object missing = System.Type.Missing;
object readOnly = false;
object isVisible = false;
Word.Document doc = app.Documents.Open(
ref filePath,
ref missing, ref readOnly, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref isVisible, ref missing,
ref missing, ref missing, ref missing);
try
{
Word.Range cachedPara2 = doc.Paragraphs[2].Range.Duplicate;
doc.Paragraphs[2].Range.FormattedText = doc.Paragraphs[1].Range.FormattedText;
doc.Paragraphs[1].Range.FormattedText = cachedPara2.FormattedText;
doc.SaveAs(#"output.docx");
}
finally
{
doc.Close();
app.Quit();
}
I expect this:
but the actual result is this:
How to get expectations?
UPDATE
With the answer below, I was able to get the expected result for my first case.
Now, in another case, I wanna do the following:
Unfortunately, I couldn't quite figure it out how .Collabse() method works. I am trying to do it with .InsertParagraphAfter():
doc.Paragraphs[2].Range.InsertParagraphAfter();
doc.Paragraphs[3].Range.FormattedText = doc.Paragraphs[5].Range.FormattedText;
doc.Paragraphs[5].Range.FormattedText = doc.Paragraphs[2].Range.FormattedText;
doc.Paragraphs[2].Range.Delete();
Where does this empty paragraph come from? How avoid it?

A range object does not have any content itself, it merely points to the location of the content, rather like a set of map co-ordinates.
What you need to do is add the content of the second paragraph before the first, which will create a new first paragraph. You can then delete what is now the third paragraph. For example:
Word.Range target = doc.Paragraphs[1].Range;
target.Collapse wdCollapseStart;
target.FormattedText = doc.Paragraphs[2].Range.FormattedText;
doc.Paragraphs[3].Range.Delete;

How to compare image (Shape) present in each page of word document through Microsoft.Interop.Word using C#.Net?

I am using following code to replace image (Shape in Microsoft.Interop.Office.Word) of the word document with new image but what the requirement from client is that I need to check the 1st Image of the 1st page of the word document and then compare this image with image of the rest of the document and if match it get replaced with new image else not so need help on how can we compare two shapes(Images)
public void ReplaceWordImage(string FilePath)
{
Word.Document d = new Word.Document();
Word.Application WordApp;
WordApp = new Microsoft.Office.Interop.Word.Application();
bool headerImage = false;
try
{
object missing = System.Reflection.Missing.Value;
object yes = true;
object no = false;
object filename = #"D:/ImageToReplace/5.docx";
d = WordApp.Documents.Open(ref filename, ref missing, ref no, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref yes, ref missing, ref missing, ref missing, ref missing);
List<Word.ShapeRange> ranges = new List<Microsoft.Office.Interop.Word.ShapeRange>();
List<Word.ShapeRange> headerRanges = new List<Microsoft.Office.Interop.Word.ShapeRange>();
foreach (Word.Shape shape in d.Shapes)
{
if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoPicture)
{
shape.Delete();
foreach (Word.Range r in ranges)
`enter code here` {
r.InlineShapes.AddPicture(#"D:\Untitled.jpg", ref missing, ref missing);
break;
}
}

The Word object model doesn't provide anything to compare two images. The best what you could do is to save both on the disk and then try comparing the bytes representation of both. However, there is a better way to get the job done. The answer is the Open XML SDK which allows getting the bytes representation of images on the fly without saving them to a disk before. The Open XML SDK contains a class WordprocessingDocument that can manipulate a memory stream containing a WordDocument content. And MemoryStream can be converted using ToArray() to a byte[]. See Convert Word of interop object to byte [] without saving physically for more information.

How to link document to different structured template

I am try to automate a process for changing the document templates of word files.
If the templates are similar structure, ie they both use heading1, then when the document is linked to the new template, it works.
However, the template structure is completely different, heading1 is no longer used, it is now section1. How can I change these section titles with code? Something along the lines of if(heading1) rename to section1;
I am using Interop.Word to perform these operations.
Below is the code I'm using:
public string UpdateDocumentWithNewTemplate(string document, string theme, string template, Word.Application wordApp)
{
try
{
object missing = System.Reflection.Missing.Value;
Word.Document aDoc = null;
object notReadOnly = false;
object isVisible = false;
wordApp.Visible = false;
// create objects from variables for wordApp
object documentObject = document;
// open existing document
aDoc = wordApp.Documents.Open(ref documentObject, ref missing, ref notReadOnly, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing, ref isVisible,
ref missing, ref missing, ref missing, ref missing);
aDoc.Activate();
// set template and theme to overwrite the existing styles
aDoc.CopyStylesFromTemplate(template);
aDoc.ApplyDocumentTheme(theme);
aDoc.UpdateStyles();
// save the file with the changes
aDoc.SaveAs(ref documentObject, ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
// close the document
aDoc.Close(ref missing, ref missing, ref missing);
if (aDoc != null)
System.Runtime.InteropServices.Marshal.ReleaseComObject(aDoc);
aDoc = null;
return documentObject.ToString();
}
catch (Exception exception)
{
return "Error: " + exception;
}
}

For the specific example you need to first import the styles from the other template, then do a Find/Replace to replace the styles applied. I see from your code that you've got the first part (aDoc.CopyStylesFromTemplate(template); aDoc.ApplyDocumentTheme(theme); aDoc.UpdateStyles();).
What many don't realize about Word's Find/Replace functionality is that it can also work with formatting. The best way to get the necessary syntax is to record a successful Find/Replace in a macro, then port the VBA to C#. In the UI:
Ctrl+H to open the Replace dialog box
With the cursor in the "Find what" box, click "More" then "Format" and choose "Style"
Select the name of the style you want to find and have replaced
Click in the "Replace with" box
Use Format/Style, again, to choose the style you want to use
Click "Replace All".
Here's the result I get:
Selection.Find.ClearFormatting
Selection.Find.Style = ActiveDocument.styles("Heading 1")
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.Style = ActiveDocument.styles("section2")
With Selection.Find
.Text = ""
.Replacement.Text = ""
.Forward = True
.wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchByte = False
.CorrectHangulEndings = False
.HanjaPhoneticHangul = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
You should use Range, not Selection. So the C# code would look something like the following code block. Note how
I get the Range of the entire document
Create a Find object for the Range and use that
To reference Styles for the Find; I show two possibilities
You can list almost all the properties for Find before using Find.Execute. It would also be possible to create object objects for each of these, with only one necessary for true and false then list these "by ref" in Find.Execute. As far as I know, this is simply a matter of personal preference. I did it this way to the most literal "translation" of the VBA to C# code.
In any case, Find.Execute "remembers" these settings, so ref missing can then be used for all the parameters you don't set specifically. In this case, only the "replace all" command is used specifically in the method.
Word.Document doc = wdApp.ActiveDocument;
Word.Range rngFind = doc.Content;
Word.Find fd = rngFind.Find;
fd.ClearFormatting();
Word.Style stylFind = doc.Styles["Heading 1"];
fd.set_Style(stylFind);
fd.Replacement.ClearFormatting();
fd.Replacement.set_Style(doc.Styles["section2"]);
fd.Text = "";
fd.Replacement.Text = "";
fd.Forward = true;
fd.Wrap = Word.WdFindWrap.wdFindStop;
fd.Format = true;
fd.MatchCase = false;
fd.MatchWholeWord = false;
fd.MatchByte = false;
fd.CorrectHangulEndings = false;
fd.HanjaPhoneticHangul = false;
fd.MatchWildcards = false;
fd.MatchSoundsLike = false;
fd.MatchAllWordForms = false;
object replaceAll = Word.WdReplace.wdReplaceAll;
object missing = Type.Missing;
fd.Execute(ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing,
ref replaceAll, ref missing, ref missing, ref missing, ref missing);

deleting text from a Word Interop document

I am having trouble trying to remove a list of data/text from a Word document using Word Interop. So far I thought that I could read through the document to find the starting text, then find the ending text, and save each of those index's to their own variable. Next I would just loop through the data from the starting index to the ending index and delete all the text in between.
Problem is that it works incorrectly and doesn't provide expected results. I must not be understanding how the Range interface works in document.Paragraphs[i+1].Range.Delete();. It deletes some lines but not all, and seems to go beyond the paragraphs that I care about to delete. What am I missing? There must be a better way to do this. Documentation seems low with Interop.
string text = " ";
int StartLocation = 0;
int EndLocation = 0;
//I roughly know the starting location
//starting at I=2248 so I don't
//through entire document
for (int i = 2248; i < 2700; i++)
{
text = document.Paragraphs[i + 1].Range.Text.ToString();
if (text.Contains("firstWordImSearchingFor"))
{
StartLocation = i;
}
if (text.Contains("lastWordImSearchingFor"))
{
EndLocation = i;
}
}
//delete everything between those paragraph locations
//(not working correctly/ skips lines)
for(int i = StartLocation; i<EndLocation-1i++)
{
document.Paragraphs[i+1].Range.Delete();
}

The drawback to the approach you're trying is that the Start and End locations (number of characters from the beginning of the Document story) will vary depending on what non-visible / non-printing characters are present. Content Controls, field codes and other things affect this - all in different ways depending on how things are being queried.
More reliable would be to store the starting point in one Range then extend it to the end point.
I also recommend using Range.Find to search for the start and end points.
Bare-bones pseudo-code example, since I don't really have enough information to go on to give you full, working code:
Word.Range rngToDelete = null;
Word.Range rngFind = document.Content;
bool wasFound = false;
object missing = System.Type.Missing;
object oEnd = Word.WdCollapseDirection.wdCollapseEnd;
wasFound = rngFind.Find.Execute("firstWordImSearchingFor", ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
if (wasFound)
{
rngToDelete = rngFind.Duplicate //rngFind is now where the term was found!
//reset the range to Find so it moves forward
rngFind.Collapse(ref oEnd);
rngFind.End = Document.Content.End
wasFound = rngFind.Find.Execute("lastWordImSearchingFor", ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing,
ref missing, ref missing, ref missing, ref missing, ref missing, ref missing);
if (wasFound)
{
rngToDelete.End = rngFind.End;
rngToDelete.Delete();
}
}

This is completely untested and is offered as a suggestion:
var docRange = document.Content;
bool inDelete = false;
foreach(var para in docRange.Paragraphs)
{
if(para.ToString().Contains("Start flag") || inDelete)
{
inDelete = true;
docRange.Delete(para);
}
if (para.ToString().Contains("End flag"))
{
// remove following line to retain this paragraph
docRange.Delete(para);
break;
}
}

programmatically insert mail mergefield into hyperlink in a word document

I'm trying to figure out how to programmatically insert mail mergefield into hyperlink in a word document.
In ms word application this is easily accomplished with the following code when in code-view(ALT+F9):
{HYPERLINK "http://example.com?id={MERGEFILED ID}"}
I consulted stackoverflow and google but came up empty-handed.
How could I accomplish something like above snippet via C# word interoperability library?
Right now this is what I have:
using mso = Microsoft.Office.Interop.Word;
public class Test
{
public void GenerateDynamicHyperlinkWithMergeField()
{
mso.Application app = new mso.Application();
object missing = System.Reflection.Missing.Value;
mso.Document doc = app.Documents.Add(ref missing, ref missing, ref missing, ref missing);
mso.Range range = app.Selection.Range;
// this is hyperlinked correctly
mso.Hyperlink hl = document.Hyperlinks.Add(range, "http://example.com?id=", ref missing, ref missing, "textToDisplay", ref missing);
// this mergfield is outside of hyperlink
mso.MailMerge merge = app.ActiveDocument.MailMerge;
mso.MailMergeField mf = merge.Fields.Add(range, "id");
// inserts mergefield code into hyperlink, but not as recognizable code by word application
mso.Hyperlink hl2 = document.Hyperlinks.Add(range, "http://example.com?id=" + mf.Code.Text, ref missing, ref missing, "textToDisplay", ref missing);
}
}
Any help would be much appreciated.
UPDATE:
To clarify what result is expected in word document;
I want this: {HYPERLINK "http://example.com?id={MERGEFILED ID}"}
But I get this with the above function: {HYPERLINK "http://example.com?id="}{MERGEFILED ID}

Try this:
string myLink = "http://example.com?id=" + mf.Code.Text;
mso.Hyperlink hl2 = document.Hyperlinks.Add(range, myString, ref missing, ref missing, "textToDisplay", ref missing);

This might be a red herring, but have you noticed it says MERGEFILED and not MERGEFIELD?
I only bring it up because it's in all instances you mention it.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read a Word Document Using c# - c#

Related

How to reverse paragraphs with Interop.Word

How to compare image (Shape) present in each page of word document through Microsoft.Interop.Word using C#.Net?

How to link document to different structured template

deleting text from a Word Interop document

programmatically insert mail mergefield into hyperlink in a word document

Categories

Resources