Improve performance of storing millions of pictures into database - c#

I have millions of pictures (each picture around 7Kb) located in a folder temp (under Windows Server 2012) and I want to store them in a SQL Server database.
What I am doing so far is:
Searching for files using: foreach (var file in directory.EnumerateFiles())
Reading each file as a binary data: byte[] data = System.IO.File.ReadAllBytes("C:\\temp\\" + file.Name);
Saving each binary data using SQLCommand:
using (SqlCommand savecmd = new SqlCommand("UPDATE myTable set downloaded=1,imagecontent=#imagebinary,insertdate='" + DateTime.Now.ToShortDateString() + "' where imagename='" + file.Name.Replace(".jpg", "") + "'", connection))
{
savecmd.Parameters.Add("#imagebinary", SqlDbType.VarBinary, -1).Value = data;
savecmd.ExecuteNonQuery();
}
Each picture inserted successfully is deleted from temp folder
This kind of fetching for a file and go and store it into database does not take a lot of time because myTable has a clustered index on imagename.
But when we talk about millions and millions of files, it takes a huge amount of time to complete this whole operation.
Is there a way to improve on this way of working? For example, instead of storing file by file, store ten by ten, or thousand by thousand? Or using threads? What is the best suggestion for this kind of problem?

You should think about indexing your image storage by an identifier, not the big nvarchar() field you use for your image name "name.jpg".
It is way more faster to search by an indexed ID.
So i would suggest to split your table in two tables.
The first one holding an primary unique ID (indexed) and the ImageBinary.
The second table holds foreign Key ID reference, insertdate, downloaded, image name (PK if needed and indexed).
By integrating views or stored procedures, you can then still insert/update via a single call to the DB, but read entries by just looking up the picture by ID directly on the first table.
To know which ID to call, you can cache the IDs in memory (and load them from table 2 at startup or so).
This should fasten the reading of pictures.
If your main problem is to bulk insert and update all the pictures, you should consider using a user define table type and bulk merge the data into the DB
https://msdn.microsoft.com/en-us/library/bb675163(v=vs.110).aspx
If you can switch your logic to just inserting pictures, not updating, you could use the .net class "SqlBulkCopy" to fasten things up.
Hope this helps,
Greetings

It sounds like your issue isn't the database, but FileIO finding the files themselves for deletion. I'd suggest splitting the temp file into multiple smaller files. If there's good distribution across the alphabet, you could have a directory for each letter (and numbers if there are some of those as well) and put the files into the directory that matches their first letter. This would make finding and deleting the files much faster. This could even be extended to have a few hundred files using the first 3 characters of the filename. This would help significantly with millions of files.

Related

I need to create a new ETL C# process with large .CSV files to SQL Tables Add/Update records

I need to bring in a number of .CSV files into unique keyed SQL Tables (Table names and column names match from source to target). I started looking at libs like Cinchoo-ETL, but I need to do an "Upsert" Meaning Update if record is present insert if it's not present. I'm not sure if Cinchoo-ETL or some other lib has this feature built in.
For example lets say the SQL Server Customer table has some records in it, Cust# is a primary Key
Cust# Name
1 Bob
2 Jack
The CSV file looks something like this:
Cust#,Name
2,Jill
3,Roger
When the ETL program runs it needs to update Cust# 2 from Jack to Jill and insert a new cust# 3 record for Roger.
Speed reusability is important as there will be 80 or so different tables, some of the tables can have several million records in them.
Any ideas for a fast easy way to do this? Keep in mind I'm not a daily developer so examples would be awesome.
Thanks!
You are describing something than can be done with a tool I developped. It's called Fuzible (www.fuzible-app.com) : in Synchronization mode, it allows you to choose the behavior of the target table (allow INSERT, UPDATE, DELETE) and your Source can be any CSV file and your target, any database.
You can contact me from the website if you need an how-to.
The software is free :)
What you have to do is to create a Job with your CSV path as a Source connection, then your Database as the Target connection.
Choose the "Synchronization" mode, which, by opposition with the "Replication" mode will compare Source and Target data.
Then, you can write as many queries as you want (one for each CSV file) like this :
MyOutputTable:SELECT * FROM MyCSVFile.CSV
No need to write more complex queries if both CSV and database table shares the same schema (same columns)
The software should be able to do the rest :) Update rows than need to be updated and create new rows if required.
This is what I did in a recent SSIS job. I load the data to a temp table just use a regular SQL query to perform this comparison. This may be cumbersome on tables with lots of fields.
-- SEE DIFFERENCES FOR YOUR AMUSEMENT
SELECT *
FROM Accounts a
INNER JOIN
DI_Accounts da
ON a.CustomerNumber = da.CustomerNumber AND (
a.FirstName <> da.FirstName
)
-- UPDATE BASED ON DIFFERENCES
UPDATE a
SET
a.FirstName = a.FirstName
FROM Accounts a
INNER JOIN
DI_Accounts da
ON a.ModelId = da.ModelId AND (
a.FirstName <> da.FirstName
)
I would recommend that you take a look at the nuget package ETLBox and the necessary extension packages for Csv & Sql Server (ETLBox.Csv + ETLBox.SqlServer).
This would allow you to write a code like this:
//Create the components
CsvSource source = new CsvSource("file1.csv");
SqlConnectionManager conn = new SqlConnectionManager("..connection_string_here..");
DbMerge dest = new DbMerge(conn, "DestinationTableName");
dest.MergeProperties.IdColumns.Add(new IdColumn() { IdPropertyName = "Cust#" });
dest.MergeMode = MergeMode.Full; //To create the deletes
dest.CacheMode = CacheMode.Partial; //Enable for bigger data sets
//Linking
source.LinkTo(dest);
//Execute the data flow
Network.Execute(source);
This code snipped would do the corresponding inserts/updates & deletes into the database table for one file. Make sure that the header names match exactly with the column names in your database table (case-sensitive). For bigger data sets you need to enable the partial cache, to avoid having all data loaded into Memory.
It will use dynamic objects under the hood (ExpandoObject). You can find more information about the merge and the tool on the website (www.etlbox.net)
The only downside is that ETLBox is not open source. But the package allows you to work with data sets up to 10.000 rows to check if it suits your needs.

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

Efficiently iterating and updating large amounts of data from a database

I have a table in SQL Server that is storing files in binary format. Each row is on average ~3MB and there are tens of thousands of rows. What I'd like to do (since I must keep these tables around), is query each row, then run some compression on the binary data, and then re-insert the data (by updating each row).
My current naive implementation simply does something similar to this (using Dapper):
var files = con.QueryAsync<MyClass>("SELECT ID, Content from Files");
foreach (var file in files)
{
... compress file.Content here
con.ExecuteAsync("UPDATE Files SET Content = #NewContent WHERE ID = #ID", { ... });
}
Obviously this is very inefficient because it first loads all files into memory, etc... I was hoping can somehow do a query/update in "batches", and IDEALLY I'd like to be able to run each batch asynchronously (if that's even possible).
Any suggestions would be appreciated (using SQL Server BTW).
Entire operation could be done on db instance, without moving data over network to application and back, using built-in function COMPRESS:
This function compresses the input expression, using the GZIP algorithm. The function returns a byte array of type varbinary(max).
UPDATE Files
SET Content = COMPRESS(Content)
WHERE ID IN (range); -- for example 1k rows per batch
If you are using SQL Server version lower than 2016 or you need "custom" compression algorithm you could use user-defined CLR function.

Optimizing SDF filesize

I recently started learning Linq and SQL. As a small project I'm writing a dictionary application for Windows Phone. The project is split into two Applications. One Application (that currently runs on my PC) generates a SDF file on my PC. The second App runs on my Windows Phone and searches the database. However I would like to optimize the data usage. The raw entries of the dictionary are written in a TXT file with a filesize of around 39MB. The file has the following layout
germanWord \tab englishWord \tab group
germanWord \tab englishWord \tab group
The file is parsed into a SDF database with the following tables.
Table Word with columns _version (rowversion), Id (int IDENTITY), Word (nvarchar(250)), Language (int)
This table contains every single word in the file. The language is a flag from my code that I used in case I want to add more languages later. A word-language pair is unique.
Table Group with columns _version (rowversion), GroupId (int IDENTITY), Caption (nvarchar(250))
This table contains the different groups. Every group is present one time.
Table Entry with columns _version (rowversion), EntryId (int IDENTITY), WordOneId (int), WordTwoId(int), GroupId(int)
This table links translations together. WordOneId and WordTwoId are foreign keys to a row in the Word Table, they contain the id of a row. GroupId defines the group the words belong to.
I chose this layout to reduce the data footprint. The raw textfile contains some german (or english) words multiple times. There are around 60 groups that repeat themselfes. Programatically I reduce the wordcount from around 1.800.000 to around 1.100.000. There are around 50 rows in the Group table. Despite the reduced number of words the SDF is around 80MB in filesize. That's more than twice the size of the the raw data. Another thing is that in order to speed up the searching of translation I plan to index the Word column of the Word table. By adding this index the file grows to over 130MB.
How can it be that the SDF with ~60% of the original data is twice as large?
Is there a way to optimize the filesize?
The database file must contain all of the data from your raw file, in addition to row metadata -- it also will contain the strings based on the datatypes specified -- I believe your option here is NVARCHAR which uses two bytes per letter. Combining these considerations, it would not surprise me that a database file is over twice as large as a text file of the same data using the ISO-Latin-1 character set.

Saving image to the database

I have some simple entity which now needs to have a Profile image. What is the proper way to do this? So, it is 1 to 1 relationship, one image is related only to one entity and vice versa. This image should be uploaded through webform together with inserting related entity.
If anyone can point me to the right direction how to persist images to the db and related entity will be great.
Just a side comment: I think is not a good idea to store images in db.
In general is not a good idea store images in db as dbs are designed to store text not big binary chunks. Is much better to store paths for images and have images in a folder. If you want to get sure of 1 to 1 relationship name image with ID of entity (1323.jpg).
If you want to have image paths you should follow some guidelines (In general code defensively):
On upload of image check that image is valid (even made a binary check of image header)
Don't allow to overwrite an existing image in case of a INSERT of a new entity.
Name images as primary key (1.jpg, 2.jpg)
On load of image don't assume that image is going to be there.
Do not allow (if possible) manual interaction with images (No remoting in machine and copying images from one place to other). Manual interaction can cause inconsistencies.
But I assume that for some reason you should do it. So in order to achieve what you want:
DB design
Create a binary column (binary or varbinary) in your table
It is better if you create it in a different table with 1-1 relationship. However the idea is avoiding to load image when hydrating entity. Use a lazy load approach to load your image only when you want.
You have to avoid to load images when you make a big select (for example if you want to load all your entities in a combo avoid SELECT * From whatever) as it will load thousands of images for nothing. As I said this can be done by having images in a different table, or loading only proper columns in SELECT or by making lazy load. (Or even better by NOT having images in DB, only paths)
C# Code
Use BynaryReader to read it
User Byte array to store it
Check this link for code example: http://www.codeproject.com/Articles/21208/Store-or-Save-images-in-SQL-Server
The code is trivial but why the DB?
If this is a website why not save it to a location on disk where you can easily reference it?
Databases are optimised to store data of a known size and relatively small size. Youre image will most likely be more than 8KB in length (mearning its a MAX datatype).
The image will be stored on a separate row/page from your "profile".
Personally I'd save the images in a known folder and use the id for the image name. For profiles that don't have an image and use a standard gif or similar, probably keep it simple / trim by having simlinks/hardlinks of the profile id to the common gif.
public class Profile
{
public int Id {get;}
public string Name {get; private set;}
public Image Picture {get; private set;}
public void Save()
{
using (var connection = new SqlConnection("myconnectionstring"))
using (var command = new SqlCommand("", connection))
{
command.CommandText =
"UPDATE dbo.TblProfile " +
"SET " +
"Name = #name, " +
"Picture = #picture " +
"WHERE ID = #id";
command.Parameters.AddWithValue("#name", Name);
command.Parameters.AddWithValue("#picture", Picture);
command.Parameters.AddWithValue("#id", Id);
command.ExecuteNonQuery();
}
}
}
I think following link would give you the solution,
Upload Image and Save in DB

Categories