I'm running into a peculiar issue when trying to process large excel files (300mb+) using a data reader.
The following code illustrates the way I open the excel file and iterate over the rows in sheet 'largesheet$':
const string inputFilePath = #"C:\largefile.xlsx";
const string connectionString =
"Provider=Microsoft.ACE.OLEDB.12.0;Extended Properties=\"Excel 12.0;IMEX=1;HDR=YES;\";Data Source=" +
inputFilePath;
// Initialize connection
using (var connection = new OleDbConnection(connectionString))
{
// Open connection
connection.Open();
// Configure command
var command = new OleDbCommand("largesheet$", connection) {CommandType = CommandType.TableDirect};
// Execute reader
var reader = command.ExecuteReader(); // <-- Completely loads file/sheet into memory
// Iterate results
while (reader.HasRows)
{
// Read single row
reader.Read();
// ...
}
// Close connection
connection.Close();
}
In my understanding this should open the excel file and load each row when needed by using the reader.Read() statement.
However, it appears that the ExecuteReader() statement does more than returning an OleDbDataReader instance. Using breakpoints I noticed that that one statement takes 30s+, and the Windows Resource Monitor indicates a steady increase of allocated memory during the execution of that statement.
Specifying the CommandBehavior parameter (e.g. SequentialAccess) of the ExecuteReader() method has no effect.
What am I doing wrong here? Are there alternative ways of processing large (excel) files?
Note: the IMEX & HDR extended properties of the connection string are intentional.
Edit: After some rational thinking I assume it is not possible to process an excel file without buffering it one way or another. Since excel files are basically a glorified collection of compressed XML files it is not possible to process a worksheet without decompressing it (and keeping it in ram or temporarily saving to disk).
The only alternative I can think of is using Microsoft.Office.Interop.Excel. Not sure how OpenXML handles it though.
From MSDN: "All rows and columns of the named table or tables will be returned when you call one of the Execute methods of a Command object." (under the Remarks section). So this would appear to be the default behavior of ExecuteReader().
ExecuteReader(CommandBehavior) may give you more options, particularly when CommandBehavior is set to SequentialAccess, though you would need to handle reading at the byte level.
Related
I am working on a table that has a LONGBLOB column and I need to SELECT/INSERT data.
At the moment the code to upload a file to the DB is the following:
using (connection = new MySqlConnection(connectionString))
{
string query = "INSERT INTO files(name, uploader, bin) VALUES (#fileName, #uploader, #bin)";
using (command = connection.CreateCommand())
{
command.CommandText = query;
command.Parameters.AddWithValue("#fileName", Path.GetFileName(filePath));
command.Parameters.AddWithValue("#uploader", "John Doe");
command.Parameters.AddWithValue("#bin", File.ReadAllBytes(filePath));
connection.Open();
command.ExecuteNonQuery();
connection.Close();
}
}
The problem is that the file is loaded as a whole in RAM, is there a way to stream data instead?
The code works, is just a matter of understanding if can be optimized
P.S. I am aware that storing big files directly into the database is bad practice, but this is legacy stuff.
In the case of a JPG that will be rendered in a web page, it is better to have it in a file on the server, then put the URL of that file in the database. When the page is rendered, it will fetch multiple images asynchronously -- making the page load seem faster to the end-user. And it avoids the need for discussing your Question.
If the BLOB is something else, please describe it and its usage.
It may be better to chunk it into multiple pieces, especially on a busy system that includes Replication.
I have one question and I'm hoping it's an easy one for you.
(Windows Forms Application, C#, Framework 3.5, SQL Server 2008 R2)
I don't know how to open Excel (it is loaded in byte[] type) via OleDB.
So, what have I done:
I have uploaded Excel (.xls(x)) file via form, and saved it in a database as varbinary(max). Now I need to read that Excel file via oleDB. I've managed to load that file from database and saved it into byte[] variable. How can I open byte[] in oleDB?
When I uploaded file for the first time (before saving it to the database), I've opened it via OleDB with just passing file path. How can I access Excel's data when it is already stored in memory as byte[]?
If you want to read using OleDB, then you have to write the bytes to disk. For example, you could do this:
var filename = System.IO.Path.GetTempFileName();
// Assuming that fileBytes is a byte[] containing what you read from your database
System.IO.File.WriteAllBytes(filename, fileBytes);
var connection = #"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + filename + ";Extended Properties=\"Excel 12.0;HDR=YES\"";
// Do your work on excel
using (System.Data.OleDb.OleDbConnection conn = new System.Data.OleDb.OleDbConnection(connection))
{
conn.Open();
using (var cmd = conn.CreateCommand())
{
cmd.CommandText = "SELECT * FROM [Sheet1$]";
using (var rdr = cmd.ExecuteReader())
{
while (rdr.Read())
{
System.Diagnostics.Debug.WriteLine(rdr["ColumnName"]);
}
}
}
conn.Close();
}
//Cleanup
System.IO.File.Delete(filename);
If you don't want to write the file to disk, you could look into using a third party library that can read excel files from a memory stream. Tools like SpreadsheetGear or Aspose are commercial tools that can accomplish this.
I don't know anything about what library you're using to deal with your Excel data, but the first thing that comes to mind is that it almost certainly has a LoadFromFile method. It might not be called that, but that's what it does. See if it also has a LoadFromStream method, and if so, take your byte[] data and load it into a MemoryStream, and load the XLS from there.
emp.upload_file = Path.GetFileName(file.FileName);
emp as your table object created and file is HttpPostedFileBase file
I am trying to read an excel file every 2 seconds, This file is getting updated by other RTD application.
I am able to read this file by Oledb connection, but problem comes when i am trying to read it every 2 seconds. Out of 10 attempts it is able to read 4-5 times only and at other attempts ,it throws exception.
Connection String
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\nids\shes.xlsm;Extended Properties="Excel 12.0 Macro;HDR=Yes;IMEX=1"
Code
//opening connection to excel file
using (OleDbConnection connection = new OleDbConnection(constr))//constr = connection string
{
try
{
connection.Open();
isconopen = true;
}
catch
{
dispatcherTimer2.Start();
connection.Close();
isconopen = false;
}
// If connection is ok , then query sheet
if (isconopen == true)
{
strcon = "SELECT * FROM [" + dsheet + "]";
using (OleDbDataAdapter adapter = new OleDbDataAdapter(strcon, connection))
{
try
{
adapter.Fill(result);
isread = true;
adapter.Dispose();
connection.Close();
}
catch
{
isread = false;
dispatcherTimer2.Start();
adapter.Dispose();
connection.Close();
}
}
}
//if able to retrieve data then call some other function
if (isread == true)
{
converToCSV(0);// for further processing
}
Please help me , i am trying this from last 1 month. Please please please please help me out
Sadly OleDB driver by default will open file exclusively then you can't open it when it's in use by someone else, even just for reading.
Two considerations:
Other application may finalize its work with the file within milliseconds so it's good to try again
Driver will always open file locked for writing (so you can't open it twice via OleDB) but it's shared for reading (so you can copy it).
That said I suggest you should first try to open it after a short pause, if it's still in use (and you can't wait more) then you can make a copy and open that.
Let me assume you have your code in a HandlExcelFile() function:
void HandleExcelFile(string path)
{
try
{
// Function will perform actual work
HandleExcelFileCore(path);
}
catch (Exception) // Be more specific
{
Thread.Sleep(100); // Arbitrary
try
{
HandleExcelFileCore(path);
}
catch (Exception)
{
string tempPath = Path.GetTempFileName();
File.Copy(path, tempPath);
try
{
HandleExcelFileCore(tempPath);
}
finally
{
File.Delete(tempPath);
}
}
}
}
Code is little bit ugly so just consider it a starting point to write your own function.
Considerations:
Retry isn't such bad thing and it's a common way to solve this kind of problems. It's, for example, what Windows shell does (and it's even more normal with networks).
If application didn't close the file then you may copy (and read) old data. If you always need most up-to-date data then you have only one choice: wait. If you can assume that unsaved data belongs to the previous time frame (T - 1, as in digital electronic when signal edge is on clock edge) then just do it and live happy.
Untested solution:
I didn't try this so you have to do it by yourself. Actuallly (I was initially wrong) you can open a read-only connection (through extended properties). It's not documented if this apply to connection only or both file handle and connection. Anyway let's try to change your connection string to:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\nids\shes.xlsm;Extended Properties="Excel 12.0 Macro;HDR=Yes;IMEX=1;ReadOnly=true"
Just added a ReadOnly=true at the end of Extended Properties.
Other solutions:
The only alternative solution that comes to my mind is to...manually read Excel file (so you can open it just for reading). That said even in this case the other application may have not written new data (so you'll read old one).
Don't use a timer at all. Change your design to use a FileSystemWatcher, you'll read the file only when notified it has been changed.
When applicable...just do not use a shared file as IPC mechanism! Well, you may not be able to change 2nd application so this may not be your case.
Do not use OleDB to read Microsoft Excel files, there are many 3rd part free libraries that don't open file with exclusive lock, for example this one.
Do not read data directly from files, if Microsoft Excel instance is always running you can use COM interop to get notifications. Check this general article about COM add-ins and this to see how you can attach your C# application with Office Interop to an existing Microsoft Excel instance.
Not sure , why do you want to read the excel the way you are doing.
You can try LinqToExcel for excel reading , its a nice little library for reading excel files also if you need to create excel then try to EPPLUS library. These library i personally found really effective when working with Excels
I had similar problem. Below fixes worked for me
1) Don't hold your connection to sheet. Instead open connection read data and close the connection immediately.
2) If you are using managed code in the unmanaged application then consider using object of managed type instead of pointer(using gcnew) and use Stack Semantics to make sure that memory is cleaned up when object goes out of scope.
How to upload an excel sheet using asp.net and know the structure of the columns in the sheet so that it would be helpful in using sqlbulkcopy to upload to a table with similar structure.
any answers would be appreciated.
Thanks in advance.
I assume you know how to do the uploading part, so I concentrate on the Excel part.
There are a bunch of 3rd-Party tools to read Excel files in .NET, which in my experience is way more flexible than using the capabilities that .NET has out-of-the-box. However here's one way you can do it:
DbProviderFactory factory = DbProviderFactories.GetFactory("System.Data.OleDb");
using (DbConnection connection = factory.CreateConnection())
{
connection.ConnectionString = "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\MyExcel.xls;Extended Properties="Excel 8.0;HDR=Yes;IMEX=1";
connection.Open();
using (DbCommand command = connection.CreateCommand())
{
command.CommandText = "SELECT * FROM [Sheet1$]";
using (DbDataReader dr = command.ExecuteReader())
{
while (dr.Read())
{
/* read data here */
}
}
}
}
Keep in mind:
The Jet OLE DB provider reads a registry key to determine how many rows are to be read to guess the type of the source column. The registry setting is: HKLM\Software\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows. By default, the value for this key is 8. Hence, the provider scans the first 8 rows of the source data to determine the data types for the columns (see http://support.microsoft.com/kb/281517) The valid range of values for the TypeGuessRows key is 0 to 16. However, if the value is 0, the number of source rows scanned is 16384.
On 64-bit systems the Microsoft.Jet.OLEDB.4.0 driver is currently not supported.
For more info on the parameters used in the connection string see here: http://www.connectionstrings.com/excel
"HDR=Yes" in the connection string indicates that the provider will not include the first row of the cell range (which may be a header row) in the RecordSet. So if the header row gives you information that you need to build the sqlbulkcopy commands you should set it to "HDR=No".
I have a DBF file and a index file.
I want to read index file and search records satisfy some condition.
(for example: search records which its StudentName begin with "A" by using Student.DBF and StudentName.idx)
How do I do this programmatically?
It would be easiest to query via OleDB Connection
using System.Data.OleDb;
using System.Data;
OleDbConnection oConn = new OleDbConnection("Provider=VFPOLEDB.1;Data Source=C:\\PathToYourDataDirectory");
OleDbCommand oCmd = new OleDbCommand();
oCmd.Connection = oConn;
oCmd.Connection.Open();
oCmd.CommandText = "select * from SomeTable where LEFT(StudentName,1) = 'A'";
// Create an OleDBAdapter to pull data down
// based on the pre-built SQL command and parameters
OleDbDataAdapter oDA = new OleDbDataAdapter(oCmd);
DataTable YourResults
oDA.Fill(YourResults);
oConn.Close();
// then you can scan through the records to get whatever
String EachField = "";
foreach( DataRow oRec in YourResults.Rows )
{
EachField = oRec["StudentName"];
// but now, you have ALL fields in the table record available for you
}
I dont have the code off the top of my head, but if you do not want to use ODBC, then you should look into reading ESRI shape files, they consist of 3 parts (or more) a .DBF (what you are looking for), a PRJ file and a .SHP file. It could take some work, but you should be able to dig out the code. You should take a look at Sharpmap on codeplex. It's not a simple task to read a dbf w/o ODBC but it can be done, and there is a lot of code out there for doing this. You have to deal with big-endian vs little-endian values, and a range of file versions as well.
if you go here you will find code to read a dbf file. specifically, you would be interested in the public void ReadAttributes( Stream stream ) method.