Using C#, I need to pull data from a word document. I have NetOffice for word installed in the project. The data is in two parts.
First, I need to pull data from the document settings.
Second, I need to pull the content of controls in the document. The content of the fields includes checkboxes, a date, and a few paragraphs. The input method is via controls, so there must be some way to interact with the controls via the api, but I don't know how to do that.
right now, I've got the following code to pull the flat text from the document:
private static string wordDocument2String(string file)
{
NetOffice.WordApi.Application wordApplication = new NetOffice.WordApi.Application();
NetOffice.WordApi.Document newDocument = wordApplication.Documents.Open(file);
string txt = newDocument.Content.Text;
wordApplication.Quit();
wordApplication.Dispose();
return txt;
}
So the question is: how do I pull the data from the controls from the document, and how do I pull the document settings (such as the title, author, etc. as seen from word), using either NetOffice, or some other package?
I did not bother to implement NetOffice, but the commands should mostly be the same (except probably for implementation and disposal methods).
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
string file = "C:\\Hello World.docx";
Microsoft.Office.Interop.Word.Document doc = word.Documents.Open(file);
// look for a specific type of Field (there are about 200 to choose from).
foreach (Field f in doc.Fields)
{
if (f.Type == WdFieldType.wdFieldDate)
{
//do something
}
}
// example of the myriad properties that could be associated with "document settings"
WdProtectionType protType = doc.ProtectionType;
if (protType.Equals(WdProtectionType.wdAllowOnlyComments))
{
//do something else
}
The MSDN reference on Word Interop is where you will find information on just about anything you need access to in a Word document.
UPDATE:
After reading your comment, here are a few document settings you can access:
string author = doc.BuiltInDocumentProperties("Author").Value;
string name = doc.Name; // this gives you the file name.
// not clear what you mean by "title"
As far as trying to understand what text you are getting from a "legacy control", I need more information as to exactly what kind of control you are extracting from. Try getting a name of the control/textbox/form/etc from within the document itself and then look up that property on the Google.
As a stab in the dark, here is an (incomplete) example of getting text from textboxes in the document:
List<string> textBoxText = new List<string>();
foreach (Microsoft.Office.Interop.Word.Shape s in doc.Shapes)
{
textBoxText.Add(s.TextFrame.TextRange.Text); //this could result in an error if there are shapes that don't contain text.
}
Another possibility is Content Controls, of which there are several types. They are often used to gather user input.
Here is some code to catch a rich text Content Control:
List<string> contentControlText = new List<string>();
foreach(ContentControl CC in doc.ContentControls)
{
if (CC.Type == WdContentControlType.wdContentControlRichText)
{
contentControlText.Add(CC.Range.Text);
}
}
Related
I'm using C# and Xamarin forms to create a phone app that (when a button is pressed) will pull specific html data from a website in and save it into a text file (that the program can read from again later). I started with the tutorial in this video: https://www.youtube.com/watch?v=zvp7wvbyceo if you want to see what I started out with, and here's the code I have so far made using this video https://www.youtube.com/watch?v=wwPx8QJn9Kk, in the the "AboutViewModel.cs" file created in the video:
Image link because this is a new account i guess and i cant embed images or something
Paste of the code itself (but the image gives you a better look at everything):
private Task WebScraper()
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("https://www.flightview.com/airport/DAB-Daytona_Beach-FL/");
foreach (var item in doc.DocumentNode.SelectNodes("//td[#class='c1']"))
{
var itemstring = item;
File.WriteAllText("AirportData.txt", itemstring);
}
return Task.CompletedTask;
}
public ICommand OpenWebCommand { get; }
public ICommand WebScraperCommand { get; }
}
}
The only error i'm getting right now is "Cannot convert 'HtmlAgilityPack.HtmlNode' to 'string'" Which i'm working on fixing but I don't think this is the best solution so anything you have is useful. Thanks :)
HtmlNode is an object, not a simple string. You probably want to use the OuterHtml property, but consult the docs to see if that is the right fit for your use case
string output = string.Empty;
foreach (var item in doc.DocumentNode.SelectNodes("//td[#class='c1']"))
{
output += item.OuterHtml;
}
File.WriteAllText("AirportData.txt", output);
note that you need to specify a path to a writable folder, the root folder of the app is not writable. See https://learn.microsoft.com/en-us/xamarin/xamarin-forms/data-cloud/data/files?tabs=windows
I've parsed html into a PDF and created a table of contents from the Header tags. The bookmarks in the document work fine, but clicking on the line in the table of contents doesn't do anything. The cursor doesn't change icons like it does if I put a URL in the link.
I used Itext RUPS to inspect the final PDF and the named destinations are in the final file.
I tried hard coding a couple of the names in just to see what happens, but they also didn't work. Putting in .CreateURL and google.com works fine.
The one thing I'm doing that may or may not be an issue is I'm creating the body document, then creating the table of contents and merging the two documents.
Maybe Bruno can make a cameo on this one.
private static List ProcessOutlineChildren(PdfDocument pdfDocument, List tableOfContents, IEnumerable<PdfOutline> pdfOutlines, IDictionary<String, PdfObject> names = null)
{
List<TabStop> tabStops = new List<TabStop>();
tabStops.Add(new TabStop(580, TabAlignment.RIGHT));
foreach (var o in pdfOutlines)
{
ListItem currentOutlineItem = new ListItem();
Paragraph paragraph = new Paragraph();
paragraph.AddTabStops(tabStops);
paragraph.Add(o.GetTitle());
paragraph.Add(new Tab());
paragraph.Add((pdfDocument.GetPageNumber((PdfDictionary) o.GetDestination().GetDestinationPage(names))).ToString());
paragraph.SetAction(PdfAction.CreateGoTo(o.GetDestination()));
currentOutlineItem.Add(paragraph);
if (o.GetAllChildren().Any())
{
currentOutlineItem.Add(ProcessOutlineChildren(pdfDocument, new List(), o.GetAllChildren(), names));
}
tableOfContents.Add(currentOutlineItem);
}
return tableOfContents;
}
public class CustomOutlineHandler : OutlineHandler
{
//PDF's require a unique name for destinations, this is how the actions/bookmarks jump to a location.
protected override string GenerateUniqueDestinationName(IElementNode element)
{
string destinationName = base.GenerateUniqueDestinationName(element);
if ("p".Equals(element.Name()))
{
destinationName = destinationName.Replace(GetDestinationNamePrefix(), "paragraph-prefix-");
}
return destinationName;
}
}
//From my main method converting things into PDF.
OutlineHandler customOutlineHandler = new CustomOutlineHandler().PutAllTagPriorityMappings(priorityMappings);
customOutlineHandler.SetDestinationNamePrefix("destination-name-");
properties.SetOutlineHandler(customOutlineHandler);
I am trying to enable the user to pass multiple documents to the web methods of web service. I can pass one document but i don't know the best way to pass more than one document.
The user can input one document with its details easily.
I have created the same object with list in order to enable the user to pass unlimited number of documents
I can make more than object for multiple documents but i prefer to make it dynamically instead of restricting it to a particular number of documents
The details of document will be viewed in the gridview but when i pass the object variable to array object of the web method, it is showing that "can't implicitly convert a type of list to object.
//Object of the document in the web service
Document doc = new Document();
doc.DocCode = docCode.Text;
doc.DocName = docname.Text;
doc.DocLocation= docloc.Text;
//the above doc object will be passed to array of document in web service
service.Documents = new Document[]
{
doc
};
//Another tried Way but i want the user to pass multiple details of document at the same time
List<Document> docs = new List<Document>();
docs.Add(New Document() {DocCode=docCode.Text, DocName = docname.Text, DocLocation = docloc.Text});
//To enable the user to check the details entered before passing to the web method
gridview1.DataSource = docs;
gridview.DataBind();
foreach (DataGridItem row in gridview1.Rows)
{
docs.ToArray();
}
//Showing an error than cant implicitly convert from list type to Document
service.Documents = new Document[]
{
docs
};
The details of document will be entered by the user using the text boxes and will be viewed in the gridview. Then, all the rows of the gridveiws will be passed to the array object of the document.
I understand that you want to put the array of docs into service.Documents. You received an error, because you can put there elements like this:
service.Documents = new Document[]
{
doc, doc
};
But you cannot put there an object of array.
Try something like this:
service.Documents = docs.ToArray();
I have a requirement where I would like users to type some string tokens into a Word document so that they can be replaced via a C# application with some values. So say I have a document as per the image
Now using the SDK I can read the document as follows:
private void InternalParseTags(WordprocessingDocument aDocumentToManipulate)
{
StringBuilder sbDocumentText = new StringBuilder();
using (StreamReader sr = new StreamReader(aDocumentToManipulate.MainDocumentPart.GetStream()))
{
sbDocumentText.Append(sr.ReadToEnd());
}
however as this comes back as the raw XML I cannot search for the tags easily as the underlying XML looks like:
<w:t><:</w:t></w:r><w:r w:rsidR="002E53FF" w:rsidRPr="000A794A"><w:t>Person.Meta.Age
(and obviously is not something I would have control over) instead of what I was hoping for namely:
<w:t><: Person.Meta.Age
OR
<w:t><: Person.Meta.Age
So my question is how do I actually work on the string itself namely
<: Person.Meta.Age :>
and still preserve formatting etc. so that when I have replaced the tokens with values I have:
Note: Bolding of the value of the second token value
Do I need to iterate document elements or use some other approach? All pointers greatly appreciated.
This is a bit of a thorny problem with OpenXML. The best solution I've come across is explained here:
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/06/13/open-xml-presentation-generation-using-a-template-presentation.aspx
Basically Eric expands the content such that each character is in a run by itself, then looks for the run that starts a '<:' sequence and then the end sequence. Then he does the substitution and recombines all runs that have the same attributes.
The example is for PowerPoint, which is generally much less content-intensive, so performance might be a factor in Word; I expect there are ways to narrow down the scope of paragraphs or whatever you have to blow up.
For example, you can extract the text of the paragraph to see if it includes any placeholders and only do the expand/replace/condense operation on those paragraphs.
Instead of doing find/replace of tokens directly, using OpenXML, you could use some 3rd party OpenXML-based template which is trivial to use and can pays itself off soon.
As Scanny pointed out, OpenXML is full of nasty details that one has to master on on-by-one basis. The learning curve is long and steep. If you want to become OpenXML guru then go for it and start climbing. If you want to have time for some decent social life there are other alternatives: just pick one third party toolkit that is based on OpenXML. I've evaluated Docentric Toolkit. It offers template based approach, where you prepare a template, which is a file in Word format, which contains placeholders for data that gets merged from the application at runtime. They all support any formatting that MS Word supports, you can use conditional content, tables, etc.
You can also create or change a document using DOM approach. Final document can be .docx or .pdf.
Docentric is licensed product, but you will soon compensate the cost by the time you will save using one of these tools.
If you will be running your application on a server, don't use interop - see this link for more details: (http://support2.microsoft.com/kb/257757).
Here is some code I slapped together pretty quickly to account for tokens spread across runs in the xml. I don't know the library much, but was able to get this to work. This could use some performance enhancements too because of all the looping.
/// <summary>
/// Iterates through texts, concatenates them and looks for tokens to replace
/// </summary>
/// <param name="texts"></param>
/// <param name="tokenNameValuePairs"></param>
/// <returns>T/F whether a token was replaced. Should loop this call until it returns false.</returns>
private bool IterateTextsAndTokenReplace(IEnumerable<Text> texts, IDictionary<string, object> tokenNameValuePairs)
{
List<Text> tokenRuns = new List<Text>();
string runAggregate = String.Empty;
bool replacedAToken = false;
foreach (var run in texts)
{
if (run.Text.Contains(prefixTokenString) || runAggregate.Contains(prefixTokenString))
{
runAggregate += run.Text;
tokenRuns.Add(run);
if (run.Text.Contains(suffixTokenString))
{
if (possibleTokenRegex.IsMatch(runAggregate))
{
string possibleToken = possibleTokenRegex.Match(runAggregate).Value;
string innerToken = possibleToken.Replace(prefixTokenString, String.Empty).Replace(suffixTokenString, String.Empty);
if (tokenNameValuePairs.ContainsKey(innerToken))
{
//found token!!!
string replacementText = runAggregate.Replace(prefixTokenString + innerToken + suffixTokenString, Convert.ToString(tokenNameValuePairs[innerToken]));
Text newRun = new Text(replacementText);
run.InsertAfterSelf(newRun);
foreach (Text runToDelete in tokenRuns)
{
runToDelete.Remove();
}
replacedAToken = true;
}
}
runAggregate = String.Empty;
tokenRuns.Clear();
}
}
}
return replacedAToken;
}
string prefixTokenString = "{";
string suffixTokenString = "}";
Regex possibleTokenRegex = new Regex(prefixTokenString + "[a-zA-Z0-9-_]+" + suffixTokenString);
And some samples of calling the function:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(memoryStream, true))
{
bool replacedAToken = true;
//continue to loop document until token's have not bee replaced. This is because some tokens are spread across 'runs' and may need a second iteration of processing to catch them.
while (replacedAToken)
{
//get all the text elements
IEnumerable<Text> texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(texts, tokenNameValuePairs);
}
wordDoc.MainDocumentPart.Document.Save();
foreach (FooterPart footerPart in wordDoc.MainDocumentPart.FooterParts)
{
if (footerPart != null)
{
Footer footer = footerPart.Footer;
if (footer != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> footerTexts = footer.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(footerTexts, tokenNameValuePairs);
}
footer.Save();
}
}
}
foreach (HeaderPart headerPart in wordDoc.MainDocumentPart.HeaderParts)
{
if (headerPart != null)
{
Header header = headerPart.Header;
if (header != null)
{
replacedAToken = true;
while (replacedAToken)
{
IEnumerable<Text> headerTexts = header.Descendants<Text>();
replacedAToken = this.IterateTextsAndTokenReplace(headerTexts, tokenNameValuePairs);
}
header.Save();
}
}
}
}
I built an application in C# that copies documents from a source NSF to a destination NSF. The destination NSF is an empty shell, retaining all design elements, based on the source NSF. I am using Lotus Notes 8.5.3 and am not connected to a Domino Server.
I use this application to split the source NSF into smaller chunks. The goal is to create destination NSFs that can be handled effectively by our automated (eDiscovery) systems. I need to ensure that as much metadata as possible are preserved.
My existing code meets these goals, except that that
(1) I lose foldering information. After copying documents, all folders are empty.
(2) All documents are marked as Read, even if they were unread in the source.
Code C#
//Establish session
NotesSession ns = new Domino.NotesSessionClass();
ns.Initialize("");
//Open source NSF
NotesDatabase nd = ns.GetDatabase("", "test.nsf", false);
//Open destination NSF.
//Assume that all design elements of nd2 are identical to those of nd
NotesDatabase nd2 = ns.GetDatabase("", "test2.nsf", false);
//Create view that returns all documents.
NotesView nView2 = nd.GetView("$All");
nd.CreateView("All-DR", "SELECT #ALL", nView2, false);
NotesView nView = NotesConnectionDatabase.GetView("All-DR");
//Loop through entries in the new view
NotesViewEntry nvec = nView.AllEntries;
nve = nvec.GetFirstEntry();
for (int j = 1; j <= intEntryCount; j++)
{
if (j == 1)
{
nve = nvec.GetFirstEntry();
}
else
{
nve = nvec.GetNextEntry(nve);
}
//Copy document to second database.
NotesDocument ndoc = nd.GetDocumentByUNID(nve.UniversalID);
ndoc.CopyToDatabase(nd2);
}
//End loop.
//All documents are copied.
The result is that I end up with a destination NSF that has all the documents copied over. Assume that all the folders are also there. However, none of the documents are in the folders. Every document is marked as read.
How can I fix the folders and unread issue?
There is a FolderReferences property in the NotesDocument class in the back-end classes. I'm not 100% sure if that property is exposed in the COM classes and interop for C#, but if it is, you can use that along with the PutInFolder() method to solve part of your problem.
As far as read/unread marks are concerned, the critical question is whether you are concerned only about the read/unread status for yourself, or whether you are trying to preserve it for all users of the database. If you only care about unread marks for yourself, then you might be able to use the getAllUnreadDocuments() method of the NotesDatabase class -- but this requires Notes/Domino 8 or above (on the machine where your code is running), and again I'm not sure if this method (or the NotesNoteCollection class that it returns) is exposed via the COM/interop interface for C#. If it is available, then you can iterate through the collection and use the MarkUnread() method. If you care about unread marks for all users, then I'm not sure if there is a way to do it at all -- but if there is, it's going to require using calls from the Notes C API.
Another thought on moving to folders, especially if the database isn't set up for FolderReferences to work:
You can iterate over the array of NotesView objects obtained from the NotesDatabase object's Views property. Each NotesView has a property that tells you if it is a folder.
Once you know about all the folders, you can iterate within each folder and collect a list of NotesDocuments that are contained within. Then by storing this information in a dictionary you could use this as a lookup while you process each document to decide what folder(s) it needs to be placed in.
Something like this (not tested):
object oViews = nd.Views;
object[] oarrViews = (object[])oViews;
Dictionary<string, List<string>> folderDict = new Dictionary<string, List<string>>();
for (int x=0; x < oarrViews.Length - 1; x++)
{
NotesView view = viewArray[x];
if (view.IsFolder)
{
NotesDocument doc = view.GetFirstDocument();
while (doc != null)
{
// Populate folderDict Dictionary by setting
// document's UNID as Key, and adding folder name to List
}
}
}
Then in your loop:
//Copy document to second database.
NotesDocument ndoc = nd.GetDocumentByUNID(nve.UniversalID);
NotesDocument newDoc = ndoc.CopyToDatabase(nd2);
if (folderDict.ContainsKey(nve.UniversalID)) {
foreach (var folderName in folderDict[nve.UniversalID]) {
newDoc.PutInFolder(folderName, true);
}
}