I recently started learning Linq and SQL. As a small project I'm writing a dictionary application for Windows Phone. The project is split into two Applications. One Application (that currently runs on my PC) generates a SDF file on my PC. The second App runs on my Windows Phone and searches the database. However I would like to optimize the data usage. The raw entries of the dictionary are written in a TXT file with a filesize of around 39MB. The file has the following layout
germanWord \tab englishWord \tab group
germanWord \tab englishWord \tab group
The file is parsed into a SDF database with the following tables.
Table Word with columns _version (rowversion), Id (int IDENTITY), Word (nvarchar(250)), Language (int)
This table contains every single word in the file. The language is a flag from my code that I used in case I want to add more languages later. A word-language pair is unique.
Table Group with columns _version (rowversion), GroupId (int IDENTITY), Caption (nvarchar(250))
This table contains the different groups. Every group is present one time.
Table Entry with columns _version (rowversion), EntryId (int IDENTITY), WordOneId (int), WordTwoId(int), GroupId(int)
This table links translations together. WordOneId and WordTwoId are foreign keys to a row in the Word Table, they contain the id of a row. GroupId defines the group the words belong to.
I chose this layout to reduce the data footprint. The raw textfile contains some german (or english) words multiple times. There are around 60 groups that repeat themselfes. Programatically I reduce the wordcount from around 1.800.000 to around 1.100.000. There are around 50 rows in the Group table. Despite the reduced number of words the SDF is around 80MB in filesize. That's more than twice the size of the the raw data. Another thing is that in order to speed up the searching of translation I plan to index the Word column of the Word table. By adding this index the file grows to over 130MB.
How can it be that the SDF with ~60% of the original data is twice as large?
Is there a way to optimize the filesize?
The database file must contain all of the data from your raw file, in addition to row metadata -- it also will contain the strings based on the datatypes specified -- I believe your option here is NVARCHAR which uses two bytes per letter. Combining these considerations, it would not surprise me that a database file is over twice as large as a text file of the same data using the ISO-Latin-1 character set.
Related
I have millions of pictures (each picture around 7Kb) located in a folder temp (under Windows Server 2012) and I want to store them in a SQL Server database.
What I am doing so far is:
Searching for files using: foreach (var file in directory.EnumerateFiles())
Reading each file as a binary data: byte[] data = System.IO.File.ReadAllBytes("C:\\temp\\" + file.Name);
Saving each binary data using SQLCommand:
using (SqlCommand savecmd = new SqlCommand("UPDATE myTable set downloaded=1,imagecontent=#imagebinary,insertdate='" + DateTime.Now.ToShortDateString() + "' where imagename='" + file.Name.Replace(".jpg", "") + "'", connection))
{
savecmd.Parameters.Add("#imagebinary", SqlDbType.VarBinary, -1).Value = data;
savecmd.ExecuteNonQuery();
}
Each picture inserted successfully is deleted from temp folder
This kind of fetching for a file and go and store it into database does not take a lot of time because myTable has a clustered index on imagename.
But when we talk about millions and millions of files, it takes a huge amount of time to complete this whole operation.
Is there a way to improve on this way of working? For example, instead of storing file by file, store ten by ten, or thousand by thousand? Or using threads? What is the best suggestion for this kind of problem?
You should think about indexing your image storage by an identifier, not the big nvarchar() field you use for your image name "name.jpg".
It is way more faster to search by an indexed ID.
So i would suggest to split your table in two tables.
The first one holding an primary unique ID (indexed) and the ImageBinary.
The second table holds foreign Key ID reference, insertdate, downloaded, image name (PK if needed and indexed).
By integrating views or stored procedures, you can then still insert/update via a single call to the DB, but read entries by just looking up the picture by ID directly on the first table.
To know which ID to call, you can cache the IDs in memory (and load them from table 2 at startup or so).
This should fasten the reading of pictures.
If your main problem is to bulk insert and update all the pictures, you should consider using a user define table type and bulk merge the data into the DB
https://msdn.microsoft.com/en-us/library/bb675163(v=vs.110).aspx
If you can switch your logic to just inserting pictures, not updating, you could use the .net class "SqlBulkCopy" to fasten things up.
Hope this helps,
Greetings
It sounds like your issue isn't the database, but FileIO finding the files themselves for deletion. I'd suggest splitting the temp file into multiple smaller files. If there's good distribution across the alphabet, you could have a directory for each letter (and numbers if there are some of those as well) and put the files into the directory that matches their first letter. This would make finding and deleting the files much faster. This could even be extended to have a few hundred files using the first 3 characters of the filename. This would help significantly with millions of files.
In order to better define my problem I'll explain in steps:
I need to consolidate selected data from 4 databases into one.
Each database logs data obtained from an industrial system (sensors and switches, mainly).
DBs are in .accdb format with encryption
Each source database has 3 columns:
timestamp (datetime format)
point_id (Variable name - text format)
_VAL (Variable value - text format in two DBs, byte in the other two DBs)
Variable value is logged in one row every time it changes (1-second resolution), and all variables are logged once every 15 minutes (to get a snapshot of the system every so often). Example:
1/9/2014 1:35:54 AM - Tank_Volume - 5,763
1/9/2014 1:35:54 AM - Line_Pressure - 14,325
1/9/2014 1:35:55 AM - Tank_Volume - 5,121
1/9/2014 1:35:56 AM - Tank_Volume - 4,911
I'm logging a total of 511 variables
The output DB requirements are:
Each row must contain one second of data for all variables, sequentially and without skipping seconds
Each variable must have its own column (511 variables + 1 for timestamp), preferably with an appropriate format to save on space (output DB must be sent by e-mail)
If the variable value hasn't changed for the given second, it can take the last logged value for that variable
It must contain data only for a selected period of time (e.g.: from 1/8/2014 1:30:00 AM to 1/8/2014 3:45:00 AM) - I have the fields for selection in the UI
The user must be prompted to save this consolidated DB
The DB should be optimized in order to reduce its size after all data is copied to it
I know it's not too complex, but I want an opinion on the best way to deal with all this data. The source databases might have more than 1 GB each (many many days of log). I'll usually get only 3~4 hours of data from them into the output DB, but it'll be 14000+ rows (one per second) with 512 cols, parsed cell by cell...I imagine that's a lot to process, right?
My idea is to:
Establish connection with the 4 source DBs (they are located in one fixed directory)
Select the data to be extracted from each DB (based on the UI Start and End datetime fields) and place it in one large DataTable (SourceData)
Once SourceData is populated, close connection with the source DBs
Create 3 output DataTables (OutputData) with an algorithm that parses each line from SourceData on a second-by-second basis and place it in the right row/column (based in the timestamp and point_id source columns) - and if there's no data for any given point in time, repeat the value from the previous second
Open connection to an output DB (supposedly empty), or create one, if possible
Check if there's any table there and drop them, if true
Create 3 tables to contain all cols (timestamp being the primary for all 3 tables)
Populate these tables with the data from OutputData
Optimize the tables to reduce size
Save the DB to a backup folder and prompt the user to save the DB in another place as well, and displaying the final file size
Clear both the SourceData and DataTable to clear RAM usage
Is there a more efficient/easy way to achieve my goal? At first I was going for immediate read/write to/from the DBs, but I figured working with variables inside the executable would be a lot faster that file I/O...
Thank you all in advance!
I have a process that takes a lot and I'm seeking ways to reduce it.
The process goes like this:
A system manager writes to text box 10-20 lines(1-3 words each)
Remove any empty and only white space lines
The system multiply each line with 3000 different suffixes (by suffix I mean additional 1-2 words)
Check for duplicate lines and remove them
Check for illegal chars and in the process duplicates against the db - if found any - remove the line.
For each line:
Select a line id (for parent id) - this is a query, like: select parentid from table where name='son'
Insert the line with the parent id
As I sees it the for each line insert takes the longest time but it is necessary for the parent id(it is one of the new lines). The Table has, among other, an id, name and parent id. The code is written in c# work with mysql.
For solution I think the only thing may be convert the code to be full mysql stored procedure, but I'm not sure how much it will helps.
I am doing the task of importing xls file to sql server 2008 using c#, the xls file contains 3
column (ProductCode = having alphanumeric values,Productname = having string values,
Categoryids = having alphanumeric values) in xls file.
When I am importing the xls through my code it reads Productname,Categoryids but ProductCode with only numeric values, it can not read the codes which containing characters.
eg : sample column values
productcode
-30-sunscreen-250ml,
04 5056,
045714PC,
10-cam-bag-pouch-navy-dot,
100102
I reads 100102, but it can not reads the [045714PC,04 5056,-30-sunscreen-250ml,10-cam-bag-pouch-navy-dot]
Please suggest any solutions.
Thanks
Excel's OLEDB driver makes assumptions about the column's data based on the first 8 rows of data. If the majority of the first 8 rows for a given column, it assumes the entire column is numeric and then can't properly handle the alphanumeric values.
There are four solutions for this:
Sort your incoming data so the majority of the first 8 rows have alphanumeric values in that column (and in any other column with mixed numeric / alphanumeric data).
Add rows of fake data in, say, rows 2-9 that you ignore when you actually perform the import, and ensure that row contains letters in any column that should not be strictly numeric.
Edit the REG_DWORD key called "TypeGuessRows" located at [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel] in your registry and change the 8 to a 0. This will force Excel to look through the entire sheet before guessing the column data types. However, this can hinder performance. (You can also change the value from 8 to anything between 1 and 16, but that just changes how many rows Excel looks at, and 16 may still not be enough for you.)
Add ";IMEX=1" in your connection string. This will change the logic to look for at least one non-numeric value instead of looking at the majority of the values. This may then be combined with solution (1) or (2) to ensure it "sees" an alphanumeric value in the appropriate columns within the first 8 rows.
When program runs 1st time it just gets some fields from a source database table say:
SELECT NUMBER, COLOR, USETYPE, ROOFMATERIALCODE FROM HOUSE; //number is uniq key
it does some in-memory processing say converting USETYPE and ROOFMATERIAL to destination database format (by using cross ref table).
Then program inserts ALL THE ROWS to destination database:
INSERT INTO BUILDING (BUILDINGID, BUILDINGNUMBER, COLOR, BUILDINGTYPE, ROOFMAT)
VALUES (PROGRAM_GENERATED_ID, NUMBER_FROM_HOUSE, COLOR_FROM_HOUSE,
CONVERTED_USETYPE_FROM_HOUSE, CONVERTED_ROOFMATERIALCODE_FROM_HOUSE);
The above is naturally not SQL but you get the idea (the values with underscores just describe the data inserted).
The next times the program should do the same except:
insert only the ones not found from target database.
update only the ones that have updated color, usetype, roofmaterialcode.
My question is:
How to implement this in efficient way?
-Do I first populate DataSet and convert fields to destination format?
-If I use only 1 DataSet how give destination db BUILDING_IDs (can i add columns to populated DataSet?)
-How to efficiently check if destination rows need refresh (if i select them one # time by BUILDING_NUMBER and check all fields it's gonna be slow)?
Thanks for your answers!
-matti
If you are using Oracle, have you looked at the MERGE statement? You give the merge statement a criteria. If records match the criteria, it performs an UPDATE. If they don't match the criteria (they aren't already in the table), it performs an INSERT. That might be helpful for what you are trying to do.
Here is the spec/example of merge.