openXML tables not being read in MS word - c#

I'm trying to read all the tables from a word file into a list, although for some reason the count is 0 regardless of how many tables are in the file. Here's my code.
public void FindAndReplace(string DocPath)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(DocPath, true))
{
using (StreamReader reader = new StreamReader(doc.MainDocumentPart.GetStream()))
{
//Text titlePlaceholder = doc.MainDocumentPart.Document.Body.Descendants<Text>().Where((x) => x.Text == "Compliance Review By:").First();
List<Table> tables = doc.MainDocumentPart.Document.Descendants<Table>().ToList();
System.Console.WriteLine(tables.Count);
tables.Count = 0. What am I doing wrong?

If all you're trying to do is READ the tables, then there's no need to open the document for editing (which is what you're doing currently)
Set the second parameter to false in WordprocessingDocument.Open() to open for reading. This will prevent the error related to opening an entry more than once in Update mode (I assume that's what you're running into anyway).
Solution based on chatter
The real culprit here has to do with using the wrong OpenXml namespace when examining tables in the document. When looking for Descendants of type Table, the passed-in type must be OpenXml.Wordprocessing.Table, NOT OpenXml.Drawing.Table
I don't know what type of object the OpenXml.Drawing.Table is used for. I'll ask about this in a separate question.

You probably are referencing a wrong Table. This should work:
var tables = doc.MainDocumentPart.Document.Descendants<DocumentFormat.OpenXml.Wordprocessing.Table>().ToList();

Anu start had the answer in the comments. The problem was that I was using an incorrect namespace. Instead of using DocumentFormat.OpenXml.Wordprocessing.Table I was using DocumentFormat.OpenXml.Drawing.Table

Related

Adding an image to excel using openXML

I have been trying to ad an image to a cell in a worksheet using openXML. Most solutions I have found while searching use code from or similar to the code at PolymathProgrammer website:
http://polymathprogrammer.com/2009/11/30/how-to-insert-an-image-in-excel-open-xml/
When I try this I get an error regarding NonVisualDrawingProperties, like this:
even though my usings should incluse it:
What am I doing wrong? Or is there some easier code I could be using?
This is because there are several NonVisualDrawingProperties in various namespaces and it's ambiguous as to which one you want here.
You can solve this by fully qualifying the name:
var nvdp = new DocumentFormat.OpenXml.Drawing.Spreadsheet.NonVisualDrawingProperties();
You can also alias the using statement to prevent having to type quite so much:
using SPD = DocumentFormat.OpenXml.Drawing.Spreadsheet;
...
var nvdp = new SPD.NonVisualDrawingProperties();

Iterate through a Microsoft Word document to find and replace tables

I have some VBA code that iterates through a document to remove tables from a document. The following code works fine in VBA:
Set wrdDoc = ThisDocument
With wrdDoc
For Each tbl In wrdDoc.Tables
tbl.Select
Selection.Delete
Next tbl
End With
Unfortunately, I cannot easily translate this code to C#, presumably because there is a newer Range.Find method. Here are three things I tried, each failing.
First attempt (re-write of the VBA code):
foreach (var item in doc.Tables)
{
item.Delete; //NOPE! No "Delete" function.
}
I tried this:
doc = app.Documents.Open(sourceFolderAndFile); //sourceFolderAndFile opens a standard word document.
var rng = doc.Tables;
foreach(var item in rng)
{
item.Delete; //NOPE! No "Delete" function.
}
I also tried this:
doc = app.Documents.Open(sourceFolderAndFile); //sourceFolderAndFile opens a standard word document.
var rng = doc.Tables;
Range.Find.Execute(... //NOPE! No Range.Find available for the table collection.
...
Could someone please help me understand how I can use C# and Word Interop (Word 2013 and 2016) to iterate through a document, find a table, and then perform a function, like selecting it, deleting it, or replacing it?
Thanks!
It took me some time to figure this answer out. With all the code samples online, I missed the need to create an app. For posterity, here is how I resolved the problem.
Make sure you have a Using statement, like this:
using MsWord = Microsoft.Office.Interop.Word;
Open the document and then work with the new msWord reference, the range, and the table. I provide a basic example below:
//open the document.
doc = app.Documents.Open(sourceFolderAndFile, ReadOnly: true, ConfirmConversions: false);
//iterate through the tables and delete them.
foreach (MsWord.Table table in doc.Tables)
{
//select the area where the table is located and delete it.
MsWord.Range rng = table.Range;
rng.SetRange(table.Range.End, table.Range.End);
table.Delete();
}
//don't forget doc.close and app.quit to clean up memory.
You can use the Range (rng) to replace the table with other items, like text, images, etc.

Csv-files to Sql-database c#

What is the best approach to store information gathered locally in .csv-files with a C#.net sql-database? My reasons for asking is
1: The data i am to handle is massive (millions of rows in each csv). 2: The data is extremely precise since it describes measurements on a nanoscopic scale, and is therefor delicate.
My first though was to store each row of the csv in a correspondant row in the database. I did this using The DataTable.cs-class. When done, i feelt that if something goes wrong when parsing the .csv-file, i would never notice.
My second though is to upload the .csvfiles to a database in it's .csv-format and later parse the file from the database to the local enviroment when the user asks for it. If even possible in c#.net with visual stuido 2013, how could this be done in a efficient and secure manner?
I used .Net DataStreams library from csv reader in my project. It uses the SqlBulkCopy class, though it is not free.
Example:
using (CsvDataReader csvData = new CsvDataReader(path, ',', Encoding.UTF8))
{
// will read in first record as a header row and
// name columns based on the values in the header row
csvData.Settings.HasHeaders = true;
csvData.Columns.Add("nvarchar");
csvData.Columns.Add("float"); // etc.
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
{
bulkCopy.DestinationTableName = "DestinationTable";
bulkCopy.BulkCopyTimeout = 3600;
// Optionally, you can declare columnmappings using the bulkCopy.ColumnMappings property
bulkCopy.WriteToServer(csvData);
}
}
It sounds like you are simply asking whether you should store a copy of the source CSV in the database, so if there was an import error you can check to see what happened after the fact.
In my opinion, this is probably not a great idea. It immediately makes me ask, how would you know that an error had occurred? You certainly shouldn't rely on humans noticing the mistake so you must develop a way to programmatically check for errors. If you have an automated error checking method you should apply that method when the import occurs and avoid the error in the first place. Do you see the circular logic here?
Maybe I'm missing something but I don't see the benefit of storing the CSV.
You should probably use Bulk Insert. With your csv-file as a source.
But this will only work if the file is accessible from the PC that is running your SQL Server.
Here you can find a nice solution as well. To be short it looks like this:
StreamReader file = new StreamReader(bulk_data_filename);
CsvReader csv = new CsvReader(file, true,',');
SqlBulkCopy copy = new SqlBulkCopy(conn);
copy.DestinationTableName = tablename;
copy.WriteToServer(csv);

Extracting data stored tables in .docx file with C#

When I run the following code on a .docx file the child2.innerText will print out the information I am after (although without seperating text from the seperate columns of the table).
My problem is that the associated innerXml is completely incomprehensible for me. I thought that there would be an 'get cell from table' method or such. I have literally no idea how to extract columns/rows from the xml I've been given though.
I'm completely new to C#, so I might be missing something obvious.
I am using the openxml and the file is .docx.
using (WordprocessingDocument wdoc = WordprocessingDocument.Open(pathToMiniToktrapport, false))
{
var table = wdoc.MainDocumentPart.Document.Body.Elements<Table>();
foreach (var child in table)
{
foreach (var child2 in child) {
System.Console.WriteLine(child2.InnerXml);
System.Console.WriteLine(child2.InnerText);
System.Console.ReadLine();
}
}
}
Thanks!
Please go through the link in MSDN which shows the typical structure and sample code
http://msdn.microsoft.com/en-us/library/office/cc850835.aspx

How to force ADO.Net to use only the System.String DataType in the readers TableSchema

I am using an OleDbConnection to query an Excel 2007 Spreadsheet. I want force the OleDbDataReader to use only string as the column datatype.
The system is looking at the first 8 rows of data and inferring the data type to be Double. The problem is that on row 9 I have a string in that column and the OleDbDataReader is returning a Null value since it could not be cast to a Double.
I have used these connection strings:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source="ExcelFile.xlsx";Persist Security Info=False;Extended Properties="Excel 12.0;IMEX=1;HDR=No"
Provider=Microsoft.Jet.OLEDB.4.0;Data Source="ExcelFile.xlsx";Persist Security Info=False;Extended Properties="Excel 8.0;HDR=No;IMEX=1"
Looking at the reader.GetSchemaTable().Rows[7].ItemArray[5], it's dataType is Double.
Row 7 in this schema correlates with the specific column in Excel I am having issues with. ItemArray[5] is its DataType column
Is it possible to create a custom TableSchema for the reader so when accessing the ExcelFiles, I can treat all cells as text instead of letting the system attempt to infer the datatype?
I found some good info at this page: Tips for reading Excel spreadsheets using ADO.NET
The main quirk about the ADO.NET interface is how datatypes are handled. (You'll notice I've been carefully avoiding the question of which datatypes are returned when reading the spreadsheet.) Are you ready for this? ADO.NET scans the first 8 rows of data, and based on that guesses the datatype for each column. Then it attempts to coerce all data from that column to that datatype, returning NULL whenever the coercion fails!
Thank you,
Keith
Here is a reduced version of my code:
using (OleDbConnection connection = new OleDbConnection(BuildConnectionString(dataMapper).ToString()))
{
connection.Open();
using (OleDbCommand cmd = new OleDbCommand())
{
cmd.Connection = connection;
cmd.CommandText = SELECT * from [Sheet1$];
using (OleDbDataReader reader = cmd.ExecuteReader())
{
using (DataTable dataTable = new DataTable("TestTable"))
{
dataTable.Load(reader);
base.SourceDataSet.Tables.Add(dataTable);
}
}
}
}
As you have discovered, OLEDB uses Jet which is limited in the manner in which it can be tweaked. If you are set on using an OleDbConnection to read from an Excel file, then you need to set the HKLM\...\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows value to zero so that the system will scan the entire resultset.
That said, if you are open to using an alternative engine to read from an Excel file, you might consider trying the ExcelDataReader. It reads all columns as strings but will let you use dataReader.Getxxx methods to get typed values. Here's a sample that fills a DataSet:
DataSet result;
const string path = #"....\Test.xlsx";
using ( var fileStream = new FileStream( path, FileMode.Open, FileAccess.Read ) )
{
using ( var excelReader = ExcelReaderFactory.CreateOpenXmlReader( fileStream ) )
{
excelReader.IsFirstRowAsColumnNames = true;
result = excelReader.AsDataSet();
}
}
Note for 64bit OS it is here:
My Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Jet\4.0\Engines\Excel
Check out the final answer on this page.
Just noticed the page you refer to says the same thing ...
Update:
The problem seems to be with the JET engine itself and not ADO. Once JET decides on the type, it sticks to it. Anything done after that has no effect; like casting the values to string in the SQL (e.g. Cstr([Column])) just results in an empty string being returned.
At this point (if there are no other answers) I'd opt for other methods: modifying the spreadsheet; modifying registry (not ideal since you will be messing with the settings for every other app the uses JET); Excel automation or a third party component that does not use JET.
If Automation option is to slow then maybe just use it to save the spreadsheet in a different format which is easier to handle.
I have faced the same issue and determined that this is something that many people commonly experience. Here are a number of solutions that have been suggested, many of which I have attempted to implement:
Add the following to your connection string(Source):
TypeGuessRows=0;ImportMixedTypes=Text
Add the following to your connection string(Source, More Discussion, Even More):
IMEX=1;HDR=NO;
Edit the following registry settings, disable "TypeGuessRows", and "ImportMixedTypes" set to "Text"(Source, Not Recommended, More Documentation):
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/TypeGuessRows
Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes
Consider using an alternative library for reading the excel file:
EPPlus
ExcelDataReader (also suggested be #Thomas)
OpenXml
Format all data in the source file as Text(at least the first 8 rows), though I understand that's typically impractical(Source, though this is relation to SSIS, but it's the same concepts)
Use a Schema.ini file to define the data type before importing the file, I found this in relation to using "Jet.OleDb" directly, maybe requiring you to modifying your connection string. This may only be applicable to CSV's I have not tried this approach.(Source, Related Post)
None of these have worked for me(though I believe they have worked for others). I am of the opinion expressed by #Asher that there is really no good solution to this problem. In my software I simply display an error message to the user(if any required column contain empty values) instructing them to format all columns as "Text".
Honestly, I think this book is more applicable to situation. The issue, already stated multiple times is:
"The data type at the destination is varchar but the assumed data
type of "double" nullifies any data that doesn't fit."(Source)
"But the problem is actually with the OLEDBDataReader. The problem
is that if it sees mostly numbers in a column, it assumes everything
is a number - if a row item being read is not a number, it simply
sets it to null! Ouch!"(Source)
"The problem seems to be with the JET engine itself and not ADO. Once
JET decides on the type, it sticks to it."(#Asher)
While I haven't found any of this documented in an official capacity I think that it's very clear that this is an intentional design decision and simply how the Jet Database Library works. I hesitate to call this library entirely useless because I think for many people some of these solutions do work, but so far for my project, I have come to the conclusion that this library cannot read multiple data types in a single column and is ill suited for general data retrieval.

Categories