Right now I'm exporting some text to a 2010 Word document. I have everything working except new lines. What is the character for a new line? I've tried "\r\n", " ^p ", and "\n", Nothing is working.
I'm using the "FindAndReplace" method to replace strings with strings.
The purpose for the newlines is some required formatting. My coworkers have a 6 line box that the text belongs in. On line 1 in that box I have "" and I'm replacing it with information from a database. If the information exceeds one line, they don't want the box to become 7 lines. So I've figured out how to calculate how many lines the text requires and I re-sized the box to 1 line. So for example if my string requires 2 lines, I want to put 4 blank lines after that.
If this is not possible, I was thinking of putting in that box:
<line1>
<line2>
<line3> and so on...
Then just replace each line individually. Any other thoughts?
Thanks in advance.
You can find each instance of new line with ^13 or (the equivalent) ^l and replace them with as many newlines as you require by concatenating ^13. The "Suchen und Ersetzen" dialog below is German for "Search and Replace". Tested in Word 2010.
Example:
This should work as is using COM automation with c#. An example link if you need one.
Here's proof of concept code:
namespace StackWord
{
using StackWord = Microsoft.Office.Interop.Word;
internal class Program
{
private static void Main(string[] args)
{
var myWord = new StackWord.Application { Visible = true };
var myDoc = myWord.Documents.Add();
var myParagraph = myDoc.Paragraphs.Add();
myParagraph.Range.Text =
"Example test one\rExample two\rExample three\r";
foreach (StackWord.Range range in myWord.ActiveDocument.StoryRanges)
{
range.Find.Text = "\r";
range.Find.Replacement.Text = "\r\r\r\r";
range.Find.Wrap = StackWord.WdFindWrap.wdFindContinue;
object replaceAll = StackWord.WdReplace.wdReplaceAll;
if (range.Find.Execute(Replace: ref replaceAll))
{
Console.WriteLine("Found and replaced.");
}
}
Console.WriteLine("Press any key to close...");
Console.ReadKey();
myWord.Quit();
}
}
}
You can always try using:
Environment.NewLine
You save your file word to word 97 - 2003(*.doc), your "FindAndReplace" method will working :D
Related
I have a problem. I need to compare word document. Text and format in c# and i found a third party library to view and process the document and it is Devexpress. So i downloaded the trial to check if the problem can be solved with this
Example i have two word document
1: This is a text example
This is not a text example
In the text above the difference is only the word not
My problem is how can i check the difference including the format?
So far this is my code for iterating the contents of the Document
public void CompareEpub(string word)
{
try
{
using (DevExpress.XtraRichEdit.RichEditDocumentServer srv = new DevExpress.XtraRichEdit.RichEditDocumentServer())
{
srv.LoadDocument(word);
MyIterator visitor = new MyIterator();
DocumentIterator iterator = new DocumentIterator(srv.Document, true);
while (iterator.MoveNext())
{
iterator.Current.Accept(visitor);
}
foreach (var item in visitor.ListOfText)
{
Debug.WriteLine("text: " + item.Text + " b: " + item.IsBold + " u: " + item.IsUnderline + " i: " + item.IsUnderline);
}
}
}
catch (Exception ex)
{
Debug.WriteLine(ex.Message);
Debug.WriteLine(ex.StackTrace);
throw ex;
}
}
public class MyIterator : DocumentVisitorBase
{
public List<Model.HtmlContent> ListOfText { get; }
public MyIterator()
{
ListOfText= new List<Model.HtmlContent>();
}
public override void Visit(DocumentText text)
{
var m = new Model.HtmlContent
{
Text = text.Text,
IsBold = text.TextProperties.FontBold,
IsItalic = text.TextProperties.FontItalic,
IsUnderline = text.TextProperties.UnderlineWordsOnly
};
ListOfText.Add(m);
}
}
With the code above i can navigate to the text and its format. But how can i use this as a text compare?
If I'm going to create a two list for each document to compare.
How can i compare it?
If i'm going to compare the text in with another list. Compare it in loop.
I will be receiving it as only two words are equal.
Can help me with this. Or just provide an idea how i can make it work.
I didn't post in the devexpress forum because i feel that this is a problem with how i will be able to do it. And not a problem with the trial or the control i've been using. And i also found out that the control doesn't have a functionality to compare text. Like the one with Microsoft word.
Thank you.
Update:
Desired output
This is (not) a text example
The text inside the () means it is not found in the first document
The output i want is like the output of Diff Match Patch
https://github.com/pocketberserker/Diff.Match.Patch
But i can't implement the code for checking the format.
I have the following code which tries to read data from a text file (so users can modify easily) and auto format a paragraph based on a the words in the text document plus variables in the form. I have the file "body" going into a field. my body text file has the following data in it
"contents: " + contents
I was hoping based on that to get
contents: Item 1, 2, etc.
based on my input. I only get exactly whats in the text doc despite putting "". What am I doing wrong? I was hoping to get variables in addition to my text.
string readSettings(string name)
{
string path = System.Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments) + "/Yuneec_Repair_Inv";
try
{
// Create an instance of StreamReader to read from a file.
// The using statement also closes the StreamReader.
using (StreamReader sr = new StreamReader(path + "/" + name + ".txt"))
{
string data = sr.ReadToEnd();
return data;
}
}
catch (Exception e)
{
// Let the user know what went wrong.
Console.WriteLine("The settings file for " + name + " could not be read:");
Console.WriteLine(e.Message);
string content = "error";
return content;
}
}
private void Form1_Load(object sender, EventArgs e)
{
createSettings("Email");
createSettings("Subject");
createSettings("Body");
yuneecEmail = readSettings("Email");
subject = readSettings("Subject");
body = readSettings("Body");
}
private void button2_Click(object sender, EventArgs e)
{
bodyTextBox.Text = body;
}
If you want to provide the ability for your users to customize certain parts of the text you should use some "indicator" that you know before hand, that can be searched and parsed out, something like everything in between # and # is something you will read as a string.
Hello #Mr Douglas#,
Today is #DayOfTheWeek#.....
At that point your user can replace whatever they need in between the # and # symbols and you read that (for example using Regular Expressions) and use that as your "variable" text.
Let me know if this is what you are after and I can provide some C# code as an example.
Ok, this is the example code for that:
StreamReader sr = new StreamReader(#"C:\temp\settings.txt");
var set = sr.ReadToEnd();
var settings = new Regex(#"(?<=\[)(.*?)(?=\])").Matches(set);
foreach (var setting in settings)
{
Console.WriteLine("Parameter read from settings file is " + setting);
}
Console.WriteLine("Press any key to finish program...");
Console.ReadKey();
And this is the source of the text file:
Hello [MrReceiver],
This is [User] from [Company] something else, not very versatile using this as an example :)
[Signature]
Hope this helps!
When you read text from a file as a string, you get a string of text, nothing more.
There's no part of the system which assumes it's C#, parses, compiles and executes it in the current scope, casts the result to text and gives you the result of that.
That would be mostly not what people want, and would be a big security risk - the last thing you want is to execute arbitrary code from outside your program with no checks.
If you need a templating engine, you need to build one - e.g. read in the string, process the string looking for keywords, e.g. %content%, then add the data in where they are - or find a template processing library and integrate it.
I have been working successfully with the C# OpenXml SDK (Unofficial Microsoft Package 2.5 from NuGet) for some time now, but have recently noticed that the following line of code returns different results depending on what mood Microsoft Word appears to be in when the file gets saved:
var fields = document.Descendants<FieldCode>();
From what I can tell, when creating the document in the first place (using Word 2013 on Windows 8.1) if you use the Insert->QuickParts->Field and choose MergeField from the Field names left hand pane, and then provide a Field name in the field properties and click OK then the field code is correctly saved in the document as I would expect.
Then when using the aforementioned line of code I will receive a field code count of 1 field. If I subsequently edit this document (and even leave this field well alone) the subsequent saving could mean that this field code no longer is returned in my query.
Another case of the same curiousness is when I see the FieldCode nodes split across multiple items. So rather than seeing say:
" MERGEFIELD Author \\* MERGEFORMAT "
As the node name, I will see:
" MERGEFIELD Aut"
"hor \\* MERGEFORMAT"
Split as two FieldCode node values. I have no idea why this would be the case, but it certainly makes my ability to match nodes that much more exciting. Is this expected behaviour? A known bug? I don't really want to have to crack open the raw xml and edit this document to work until I understand what is going on. Many thanks all.
I came across this very problem myself, and found a solution that exists within OpenXML: a utility class called MarkupSimplifier which is part of the PowerTools for Open XML project. Using this class solved all the problems I was having that you describe.
The full article is located here.
Here are some pertinent exercepts :
Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting.
It goes on to say:
Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup.
An example of the utility class in use is:
SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);
I have used this many times with Word 2010 documents using VS2015 .Net Framework 4.5.2 and it has made my life much, much easier.
Update:
I have revisited this code and have found it clears upon runs on MERGEFIELDS but not IF FIELDS that reference mergefields e.g.
{if {MERGEFIELD When39} = "Y???" "Y" "N" }
I have no idea why this might be so, and examination of the underlying XML offers no hints.
Word will often split text runs with into multiple text runs for no reason I've ever understood. When searching, comparing, tidying etc. We preprocess the body with method which combines multiple runs into a single text run.
/// <summary>
/// Combines the identical runs.
/// </summary>
/// <param name="body">The body.</param>
public static void CombineIdenticalRuns(W.Body body)
{
List<W.Run> runsToRemove = new List<W.Run>();
foreach (W.Paragraph para in body.Descendants<W.Paragraph>())
{
List<W.Run> runs = para.Elements<W.Run>().ToList();
for (int i = runs.Count - 2; i >= 0; i--)
{
W.Text text1 = runs[i].GetFirstChild<W.Text>();
W.Text text2 = runs[i + 1].GetFirstChild<W.Text>();
if (text1 != null && text2 != null)
{
string rPr1 = "";
string rPr2 = "";
if (runs[i].RunProperties != null) rPr1 = runs[i].RunProperties.OuterXml;
if (runs[i + 1].RunProperties != null) rPr2 = runs[i + 1].RunProperties.OuterXml;
if (rPr1 == rPr2)
{
text1.Text += text2.Text;
runsToRemove.Add(runs[i + 1]);
}
}
}
}
foreach (W.Run run in runsToRemove)
{
run.Remove();
}
}
I tried to simplify the document with Powertools but the result was a corrupted word file. I make this routine for simplify only fieldcodes that has specifics names, works in all parts on the docs (maindocumentpart, headers and footers):
internal static void SimplifyFieldCodes(WordprocessingDocument document)
{
var masks = new string[] { Constants.VAR_MASK, Constants.INP_MASK, Constants.TBL_MASK, Constants.IMG_MASK, Constants.GRF_MASK };
SimplifyFieldCodesInElement(document.MainDocumentPart.RootElement, masks);
foreach (var headerPart in document.MainDocumentPart.HeaderParts)
{
SimplifyFieldCodesInElement(headerPart.Header, masks);
}
foreach (var footerPart in document.MainDocumentPart.FooterParts)
{
SimplifyFieldCodesInElement(footerPart.Footer, masks);
}
}
internal static void SimplifyFieldCodesInElement(OpenXmlElement element, string[] regexpMasks)
{
foreach (var run in element.Descendants<Run>()
.Select(item => (Run)item)
.ToList())
{
var fieldChar = run.Descendants<FieldChar>().FirstOrDefault();
if (fieldChar != null && fieldChar.FieldCharType == FieldCharValues.Begin)
{
string fieldContent = "";
List<Run> runsInFieldCode = new List<Run>();
var currentRun = run.NextSibling();
while ((currentRun is Run) && currentRun.Descendants<FieldCode>().FirstOrDefault() != null)
{
var currentRunFieldCode = currentRun.Descendants<FieldCode>().FirstOrDefault();
fieldContent += currentRunFieldCode.InnerText;
runsInFieldCode.Add((Run)currentRun);
currentRun = currentRun.NextSibling();
}
// If there is more than one Run for the FieldCode, and is one we must change, set the complete text in the first Run and remove the rest
if (runsInFieldCode.Count > 1)
{
// Check fielcode to know it's one that we must simplify (for not to change TOC, PAGEREF, etc.)
bool applyTransform = false;
foreach (string regexpMask in regexpMasks)
{
Regex regex = new Regex(regexpMask);
Match match = regex.Match(fieldContent);
if (match.Success)
{
applyTransform = true;
break;
}
}
if (applyTransform)
{
var currentRunFieldCode = runsInFieldCode[0].Descendants<FieldCode>().FirstOrDefault();
currentRunFieldCode.Text = fieldContent;
runsInFieldCode.RemoveAt(0);
foreach (Run runToRemove in runsInFieldCode)
{
runToRemove.Remove();
}
}
}
}
}
}
Hope this helps!!!
I have a requirement where I would like users to type some string tokens into a Word document so that they can be replaced via a C# application with some values. So say I have a document as per the image
Now using the SDK I can read the document as follows:
private void InternalParseTags(WordprocessingDocument aDocumentToManipulate)
{
StringBuilder sbDocumentText = new StringBuilder();
using (StreamReader sr = new StreamReader(aDocumentToManipulate.MainDocumentPart.GetStream()))
{
sbDocumentText.Append(sr.ReadToEnd());
}
however as this comes back as the raw XML I cannot search for the tags easily as the underlying XML looks like:
<w:t><:</w:t></w:r><w:r w:rsidR="002E53FF" w:rsidRPr="000A794A"><w:t>Person.Meta.Age
(and obviously is not something I would have control over) instead of what I was hoping for namely:
<w:t><: Person.Meta.Age
OR
<w:t><: Person.Meta.Age
So my question is how do I actually work on the string itself namely
<: Person.Meta.Age :>
and still preserve formatting etc. so that when I have replaced the tokens with values I have:
Note: Bolding of the value of the second token value
Do I need to iterate document elements or use some other approach? All pointers greatly appreciated.
This is a bit of a thorny problem with OpenXML. The best solution I've come across is explained here:
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/06/13/open-xml-presentation-generation-using-a-template-presentation.aspx
Basically Eric expands the content such that each character is in a run by itself, then looks for the run that starts a '<:' sequence and then the end sequence. Then he does the substitution and recombines all runs that have the same attributes.
The example is for PowerPoint, which is generally much less content-intensive, so performance might be a factor in Word; I expect there are ways to narrow down the scope of paragraphs or whatever you have to blow up.
For example, you can extract the text of the paragraph to see if it includes any placeholders and only do the expand/replace/condense operation on those paragraphs.
Instead of doing find/replace of tokens directly, using OpenXML, you could use some 3rd party OpenXML-based template which is trivial to use and can pays itself off soon.
As Scanny pointed out, OpenXML is full of nasty details that one has to master on on-by-one basis. The learning curve is long and steep. If you want to become OpenXML guru then go for it and start climbing. If you want to have time for some decent social life there are other alternatives: just pick one third party toolkit that is based on OpenXML. I've evaluated Docentric Toolkit. It offers template based approach, where you prepare a template, which is a file in Word format, which contains placeholders for data that gets merged from the application at runtime. They all support any formatting that MS Word supports, you can use conditional content, tables, etc.
You can also create or change a document using DOM approach. Final document can be .docx or .pdf.
Docentric is licensed product, but you will soon compensate the cost by the time you will save using one of these tools.
If you will be running your application on a server, don't use interop - see this link for more details: (http://support2.microsoft.com/kb/257757).
Here is some code I slapped together pretty quickly to account for tokens spread across runs in the xml. I don't know the library much, but was able to get this to work. This could use some performance enhancements too because of all the looping.
/// <summary>
/// Iterates through texts, concatenates them and looks for tokens to replace
/// </summary>
/// <param name="texts"></param>
/// <param name="tokenNameValuePairs"></param>
/// <returns>T/F whether a token was replaced. Should loop this call until it returns false.</returns>
private bool IterateTextsAndTokenReplace(IEnumerable<Text> texts, IDictionary<string, object> tokenNameValuePairs)
{
List<Text> tokenRuns = new List<Text>();
string runAggregate = String.Empty;
bool replacedAToken = false;
foreach (var run in texts)
{
if (run.Text.Contains(prefixTokenString) || runAggregate.Contains(prefixTokenString))
{
runAggregate += run.Text;
tokenRuns.Add(run);
if (run.Text.Contains(suffixTokenString))
{
if (possibleTokenRegex.IsMatch(runAggregate))
{
string possibleToken = possibleTokenRegex.Match(runAggregate).Value;
string innerToken = possibleToken.Replace(prefixTokenString, String.Empty).Replace(suffixTokenString, String.Empty);
if (tokenNameValuePairs.ContainsKey(innerToken))
{
//found token!!!
string replacementText = runAggregate.Replace(prefixTokenString + innerToken + suffixTokenString, Convert.ToString(tokenNameValuePairs[innerToken]));
Text newRun = new Text(replacementText);
run.InsertAfterSelf(newRun);
foreach (Text runToDelete in tokenRuns)
{
runToDelete.Remove();
}
replacedAToken = true;
}
}
runAggregate = String.Empty;
tokenRuns.Clear();
}
}
}
return replacedAToken;
}
string prefixTokenString = "{";
string suffixTokenString = "}";
Regex possibleTokenRegex = new Regex(prefixTokenString + "[a-zA-Z0-9-_]+" + suffixTokenString);
And some samples of calling the function:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(memoryStream, true))
{
bool replacedAToken = true;
//continue to loop document until token's have not bee replaced. This is because some tokens are spread across 'runs' and may need a second iteration of processing to catch them.
while (replacedAToken)
{
//get all the text elements
IEnumerable<Text> texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(texts, tokenNameValuePairs);
}
wordDoc.MainDocumentPart.Document.Save();
foreach (FooterPart footerPart in wordDoc.MainDocumentPart.FooterParts)
{
if (footerPart != null)
{
Footer footer = footerPart.Footer;
if (footer != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> footerTexts = footer.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(footerTexts, tokenNameValuePairs);
}
footer.Save();
}
}
}
foreach (HeaderPart headerPart in wordDoc.MainDocumentPart.HeaderParts)
{
if (headerPart != null)
{
Header header = headerPart.Header;
if (header != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> headerTexts = header.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(headerTexts, tokenNameValuePairs);
}
header.Save();
}
}
}
}
I am developing a .NET program using VSTO 2010 running .NET 4.0 to find a specific subheading in a set of word documents and copy all content under that subheading (say "Requirements") using Word.Interop. I succeeded by means of a for loop that matched words, using which I search for this word and then the starting word of the next section (say "Functionality").
Now the documents also have a contents page so i found that simple word matching wouldn't do as it would return the first seen occurrence which was definitely in the contents section. So I tried finding the second occurrence an was successful but then realized that it could even be that the word might repeat itself much before the subheading. Hence I resorted to finding the sentence. here I was successful here in finding both the words (I had to modify the search string to "Requirements\r" because thats how it was being read)
Anyhow. The problem i am facing now is that after I get the starting and ending sentences, I selected the entire document and using MoveStart and MoveEnd , i reduced down the selection before copying it and pasting it in another word document,(as i dont know about using Range or Bookmark)
However , while i was successful in moving the start and though the end position was correct, the MoveEnd always moves to some text that is at least 10 sentences beyond the actual. I've been at this for 2 weeks now and any help in this matter would be greatly appreciated. I dont mean any disrepect to all the programmers out there in the world.
I've shown the code I'm using.
The variables used are self explanatory.
//SourceApp and SourceDoc - Word application that reads source of release notes
//DestinationApp and DestinationDoc = Word application that writes into new document
private void btnGenerate_Click(object sender, EventArgs e)
{
int startpos = findpos(SourceDoc, 1, starttext, sentencecount);
int endpos = findpos(SourceDoc, startpos, endtext, sentencecount);
object realstart = startpos - 1; // To retain the subheading
object realend = -(sentencecount - (endpos - 1)); // to subtract the next subheading
SourceDoc.Activate();
SourceDoc.ActiveWindow.Selection.WholeStory();
SourceDoc.ActiveWindow.Selection.MoveStart(WdUnits.wdSentence, realstart);
SourceDoc.ActiveWindow.Selection.MoveEnd(WdUnits.wdSentence, realend); // the problematic bit
SourceDoc.ActiveWindow.Selection.Copy();
IDataObject data = Clipboard.GetDataObject();
string allText = data.GetData(DataFormats.Text).ToString();
DestinationDoc.Activate();
DestinationDoc.ActiveWindow.Selection.WholeStory();
DestinationDoc.ActiveWindow.Selection.Delete();
DestinationDoc.ActiveWindow.Selection.Paste();
DestinationDoc.Save();
((_Application)SourceApp).Quit();
((_Application)DestinationApp).Quit();
textBox1.AppendText(allText);
}
int findpos(Document docx, int startpos, string txt, int sentencecount)
{
int pos = 0;
string text;
for (int i = startpos; i <= sentencecount; i++)
{
text = docx.Sentences[i].Text;
if (string.Equals(text, txt))
{
pos = i;
break;
}
}
return pos;
}
I would also be extremely grateful if there was a way to extract specific subheading only (like 3.1 , 5.2.3 etc.) which is what I'm trying to achieve. The question is just my way of doing things and I'm open to a better way as well.
Many thanks in advance.