I'm developing a solution that allows people to upload a DOCX file as a template. This template is used for generating Word documents with database info.
What I would like to do is once a template gets uploaded, to check it for errors. (I don't want my parser crashing when a template is used.)
I've seen the question about checking a signature of a Word template, but that isn't enough to validate the integrity of the file. Of course it is possible to try to unzip the file, validate the XML in there, and so on, but this is rather CPU intensive and I'd like a different approach if there is one.
Are there any solutions that are part of the Open XML SDK or other standard approaches to this? Any ideas are apreciated.
in C# off the MSDN site
public static bool IsDocumentValid(WordprocessingDocument mydoc)
{
OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(mydoc);
foreach (ValidationErrorInfo error in errors)
Debug.Write(error.Description);
return (errors.Count() == 0);
}
Related
I'm trying to implement this feature in my application.
Just like in windows, I type into the search box and if the File contents is checked in the settings, than no matter its a text file or pdf/word file, the search returns me the file that contains the string in the search box.
So, I already have come up with a application for files and folder search which works pretty good for the file content search for text files and word file. I'm using interop word for word files.
I know, I can use iTextSharp or some other 3rd party stuff to do this for pdf files. But that doesn't satisfy me. I just wanted to find out how windows does it? Or if anyone else has done it in a different way? I just didn't wanted to use any 3rd party tool but doesn't mean I can't. I just wanted to keep my application light and not dump it with many tools.
As far as I know, it is not possible to search for pdf content with out having 3rd party tool, software or utility installed. So there are pdfgrep for example. But if you manage to any way make a c# program, I would include a third party library to do the job.
I made a solution for some thing similar in this answer Read specific value based on label name from PDF in C#, with a bit of tweak you can have what you are looking for. The only thing is with PdfClown, it is for .net framework, but at the other hand it is open source, free and has no limitation. But if you are looking for .net core you might find some free (with limitation) or paid pdf libraries.
As you request in the comment here is a sample solution to find text in side pdf pages. I have left comments inside the code:
//The found content
private List<string> _contentList;
//Search for content in a given pdf file
public bool SearchPdf(FileInfo fileInfo, string word)
{
_contentList = new List<string>();
ExtractPages(fileInfo.FullName);
var content = string.Join(" ", _contentList);
return content.Contains(word);
}
//Extract content for each page of given pdf file
private void ExtractPages(string filePath)
{
using (var file = new File(filePath))
{
var document = file.Document;
foreach (var page in document.Pages)
{
Extract(new ContentScanner(page));
}
}
}
//Extract content of pdf page and put the found result inside _contentList
private void Extract(ContentScanner level)
{
if (level == null)
return;
while (level.MoveNext())
{
var content = level.Current;
switch (content)
{
case ShowText text:
{
var font = level.State.Font;
_contentList.Add(font.Decode(text.Text));
break;
}
case Text _:
case ContainerObject _:
Extract(level.ChildLevel);
break;
}
}
}
Now lets do quick test, so we assume all your invoice are in c:\temp folder:
static void Main(string[] args)
{
var program = new SearchPdfContent();
DirectoryInfo d = new DirectoryInfo(#"c:\temp");
FileInfo[] Files = d.GetFiles("*.pdf");
var word = "Sushi";
foreach (FileInfo file in Files)
{
var found = program.SearchPdf(file, word);
if (found)
{
Console.WriteLine($"{file.FullName} contains word {word}");
}
}
}
In my case I have for example word sushi inside the invoice:
c:\temp\invoice0001.pdf contains word Sushi
All that said, this is an example of solution. You can take it from here bring it to the next level. Enjoy your day.
I leave some links of what I have searched for:
Searching for files with specific file content
How to search contents of multiple pdf files?
Windows search PDF contents
https://superuser.com/questions/402673/how-to-search-inside-pdfs-with-windows-search
If your application is meant to search for file contents from binaries stored into your DB, the SQL Full-Text search feature can achieve this for you.
You just need to make sure that you have the required IFilters installed and create a full-text index on the table where the binary data is stored.
But if your application must access a folder in real time and search for file contents, you will probably need a third party tool just like #maytham-ɯɐɥʇʎɐɯ said.
I am currently building an Excel file by hand using OpenXml. I'm in the process of adding the sheets, however, I have come across an issue. I have a loop that adds the names of each sheet in but once it runs and I try to open the file, I get the following message:
"We found a problem with some content in 'FileName.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, Click Yes."
I think the issue might be due that I am adding in the name of each sheet using a string variable. When I take it out and add something else, it works. Below is my code where I am looping through and adding my sheets.
//Technology Areas
foreach (DataRow dr in techAreaDS.Rows)
{
var data = dr["TechAreaName"].ToString().Split('-');
var techArea = data[2].TrimStart();
var techAreaSheet = new Sheet { Id = workbookPart.GetIdOfPart(worksheetPart),
SheetId = sheetId, Name = techArea };
sheets.Append(techAreaSheet);
sheetId++;
}
I've seen people mention it is an issue with cells having strings that can be converted into strings, but in this case, the string will always be a string. Any help would be appreciated.
EDIT: I've figured out the problem. The issue is the Name property has a Max Length of 31. One of my items has a 42 length, hence the error. I did find a cool set of code to validate my OpenXml. Link.
UPDATE:
Oddly enough, someone thinks this question was about finding some code to help validate what I was doing. It was not... The question is clear: why was I receiving an error when trying to name sheets. I was not asking for validation code, though I found some.
I do ask that if you wish to help, please read the question versus assume what I was asking, and if you don't know what I wish to have answered, ask...
In order to find out the issue(s) causing this error, you need to validate the generated document.
Besides using the built in validation method as described here, which doesn't show you all issues as I found out, I suggest that you download and install Microsoft's Open XML SDK 2.5 for Microsoft Office.
It contains Microsoft's Open XML SDK 2.5 Productivity Tool, which is very helpful here:
Create a copy of the damaged XLSX file, and apply the fixes as Microsoft Excel is suggesting (suppose you have the files FileName_corrupt.xlsx and FileName_fixed.xlsx
Then, run Microsoft's Open XML SDK 2.5 Productivity Tool, open FileName_corrupt.xlsx, select "Compare Files" and specify the 2nd file FileName_fixed.xlsx. This allows you to compare the XML structure of both files.
Let Microsoft's Open XML SDK 2.5 Productivity Tool generate C# code from both files: Open them first, then right-click on the root level and select "Reflect Code". This will create C# code which allows you to generate the same file. Save both C# code versions (i.e. FileName_corrupt.cs and FileName_fixed.cs)
Now you can compare the differences via Visual Studio: Either use
devenv.exe /diff FileName_corrupt.cs FileName_fixed.cs
to compare them, or use the batch file I've created to launch the VS compare - this is a hidden feature in Visual Studio, it allows to compare 2 local files being not part of TFS.
This way you should be able to work out the differences and allow you to fix your code.
NOTE: For a first validation, I do suggest to use the validation code. Only if it still fails, use the steps above. For validation you can use
public static string ValidateOpenXmlDocument(OpenXmlPackage pXmlDoc, bool throwExceptionOnValidationFail=false)
{
using (var docToValidate = pXmlDoc)
{
var validator = new DocumentFormat.OpenXml.Validation.OpenXmlValidator();
var validationErrors = validator.Validate(docToValidate).ToList();
var errors = new System.Text.StringBuilder();
if (validationErrors.Any())
{
var errorMessage = string.Format("ValidateOpenXmlDocument: {0} validation error(s) with document", validationErrors.Count);
errors.AppendLine(errorMessage);
errors.AppendLine();
}
foreach (var error in validationErrors)
{
errors.AppendLine("Description: " + error.Description);
errors.AppendLine("ErrorType: " + error.ErrorType);
errors.AppendLine("Node: " + error.Node);
errors.AppendLine("Path: " + error.Path.XPath);
errors.AppendLine("Part: " + error.Part.Uri);
if (error.RelatedNode != null)
{
errors.AppendLine("Related Node: " + error.RelatedNode);
errors.AppendLine("Related Node Inner Text: " + error.RelatedNode.InnerText);
}
errors.AppendLine();
errors.AppendLine("==============================");
errors.AppendLine();
}
if (validationErrors.Any() && throwExceptionOnValidationFail)
{
throw new Exception(errors.ToString());
}
if (errors.Length > 0)
{
System.Diagnostics.Debug.WriteLine(errors.ToString());
}
return errors.ToString();
}
along with
public static void ValidateExcelDocument(string fileName)
{
using (var xlsx = SpreadsheetDocument.Open(fileName, true))
{
ValidateOpenXmlDocument(xlsx);
}
}
With a slight modification, you can easily use the code above for Microsoft Word validation too:
public static void ValidateWordDocument(string fileName)
{
using (var docx = WordprocessingDocument.Open(fileName, true))
{
ValidateOpenXmlDocument(docx);
}
}
I've figured out the problem. The issue is the Name property has a Max Length of 31 characters. The text I'm trying to use sometimes exceeds that limit (one has 42 characters). I also found a pretty cool set of code to validate my Open Xml to find out what the specific issue is. Link
UPDATED:
I have now added the following code based on the answers below:
foreach (Word.XMLSchemaReference reference in Globals.ThisDocument.Application.ActiveDocument.XMLSchemaReferences)
{
if (reference.NamespaceURI.Contains("ActionsPane"))
{
reference.Delete();
}
}
This gives me no errors at design, time, but still gives the user the message described in the original question about choosing an xml expansion pack. So the original problem hasn't been solved.
ORIGINAL QUESTION:
Using Visual Studio 2013, I have created a Word Document level project which has an action pane. Everything works well. The only problem is what when someone uses this documents action pane to insert text into the document and then save it. The next time that saved document is opened, the user gets the following message
One or more XML expansion packs are available for this file.
Choose one from the list below.
No XML expansion pack
Microsoft Actions Pane 3
How do I stop this from happening when saved documents are opened?
You need to check the XMLSchemaReferences of the word document to see if any of the Xml schema's has a namespace referring to the action pane and if so, delete it.
This needs to be done before saving.
The message you get when opening the document is because it contains a schema reference to the action pane namespace.
Something like this :
foreach (XMLSchemaReference reference in wordDocument.XMLSchemaReferences)
{
if (reference.NamespaceURI.Contains("ActionsPane"))
{
reference.Delete();
}
}
where wordDocument is the actual word document you create.
If you don't have a reference to the word document and you just want to use the current document that has the focus, you can use Globals.ThisAddIn.Application.ActiveDocument instead of wordDocument in the code.
I've solved my problem using Huron answer. Thanks Huron.
Remove your xmlreference on your active document.
for my case, i remove my xmlreference on the after mail merge event
void ThisApplication_MailMergeAfterMerge(Word.Document Doc, Word.Document DocResult)
{
DocResult.Fields.Update();
// remove customization
Office.DocumentProperties properties = (Office.DocumentProperties) DocResult.CustomDocumentProperties;
properties["_AssemblyName"].Delete();
properties["_AssemblyLocation"].Delete();
DocResult.RemoveDocumentInformation(Word.WdRemoveDocInfoType.wdRDIDocumentProperties);
foreach (XMLSchemaReference reference in DocResult.XMLSchemaReferences)
{
if (reference.NamespaceURI.Contains("ActionsPane"))
{
reference.Delete();
}
}
ThisApplication.Visible = true;
ThisApplication.NormalTemplate.Saved = true;
Doc.MailMerge.DataSource.Close();
}
I have a VSTO document level customization that performs specific functionality when opened from within our application. Basically, we open normal documents from inside of our application and I copy the content from the normal docx file into the VSTO document file which is stored inside of our database.
var app = new Microsoft.Office.Interop.Word.Application();
var docs = app.Documents;
var vstoDoc = docs.Open(vstoDocPath);
var doc = docs.Open(currentDocPath);
doc.Range().Copy();
vstoDoc.Range().PasteAndFormat(WdRecoveryType.wdFormatOriginalFormatting);
Everything works great, however using the above code leaves out certain formatting related to the document. The code below fixes these issues, but there will most likely be more issues that I come across, as I come across them I could address them one by one ...
for (int i = 0; i < doc.Sections.Count; i++)
{
var footerFont = doc.Sections[i + 1].Footers.GetEnumerator();
var headerFont = doc.Sections[i + 1].Headers.GetEnumerator();
var footNoteFont = doc.Footnotes.GetEnumerator();
foreach (HeaderFooter foot in vstoDoc.Sections[i + 1].Footers)
{
footerFont.MoveNext();
foot.Range.Font.Name = ((HeaderFooter)footerFont.Current).Range.Font.Name;
}
foreach (HeaderFooter head in vstoDoc.Sections[i + 1].Headers)
{
headerFont.MoveNext();
head.Range.Font.Name = ((HeaderFooter)headerFont.Current).Range.Font.Name;
}
foreach (Footnote footNote in vstoDoc.Footnotes)
{
footNoteFont.MoveNext();
footNote.Range.Font.Name = ((Footnote)footNoteFont.Current).Range.Font.Name;
}
}
I need a fool proof safe way of copying the content of one docx file to another docx file while preserving formatting and eliminating the risk of corrupting the document. I've tried to use reflection to set the properties of the two documents to one another, the code does start to look a bit ugly and I always worry that certain properties that I'm setting may have undesirable side effects. I've also tried zipping and unzipping the docx files, editing the xml manually and then rezipping afterwards, this hasn't worked too well, I've ended up corrupting a few of the documents during this process.
If anyone has dealt with a similar issue in the past, please could you point me in the right direction.
Thank you for your time
This code copies and keeps source formatting.
bookmark.Range.Copy();
Document newDocument = WordInstance.Documents.Add();
newDocument.Activate();
newDocument.Application.CommandBars.ExecuteMso("PasteSourceFormatting");
There is one more elegant way to manage it based upon
Globals.ThisAddIn.Application.ActiveDocument.Range().ImportFragment(filePath);
or you can do the following
Globals.ThisAddIn.Application.Selection.Range.ImportFragment(filePath);
in order to obtain current range where filePath is a path to the document you are copping from.
I'm trying to build an XML file from an Excel spreadsheet (no header row). This will be part of a service on a server, so I really don't want to use the MS Office PIO files. I found LinqToExcel on google and tried this code:
var clientExcel = new ExcelQueryFactory(excelFileName);
var sourceXml = new XElement("rows",
clientExcel.WorksheetNoHeader().Select(line => new XElement("row",
line.Select((column, index) => new XElement("Column_" + index, column)))));
The code compiles, but at runtime I get a TargetInvocationException. I've worried at this for the better part of a day, but can't figure out where I've gone wrong.
I'd appreciate it if someone would set me straight.
...if u enable Developer Menu in excel, there's an 'export to xml' option
in 2003 its done very easy by Save As > .XML
in 2007 u might need to prepare mapping schema. but it still better than write codes where u dont know where is your data header
Unfortunately LinqToExcel can only read data from spreadsheets. It can not add or update spreadsheet data.
Checkout the ExcelLibary project for writing to excel