Consolidate and parse 4 databases into one - c#

In order to better define my problem I'll explain in steps:
I need to consolidate selected data from 4 databases into one.
Each database logs data obtained from an industrial system (sensors and switches, mainly).
DBs are in .accdb format with encryption
Each source database has 3 columns:
timestamp (datetime format)
point_id (Variable name - text format)
_VAL (Variable value - text format in two DBs, byte in the other two DBs)
Variable value is logged in one row every time it changes (1-second resolution), and all variables are logged once every 15 minutes (to get a snapshot of the system every so often). Example:
1/9/2014 1:35:54 AM - Tank_Volume - 5,763
1/9/2014 1:35:54 AM - Line_Pressure - 14,325
1/9/2014 1:35:55 AM - Tank_Volume - 5,121
1/9/2014 1:35:56 AM - Tank_Volume - 4,911
I'm logging a total of 511 variables
The output DB requirements are:
Each row must contain one second of data for all variables, sequentially and without skipping seconds
Each variable must have its own column (511 variables + 1 for timestamp), preferably with an appropriate format to save on space (output DB must be sent by e-mail)
If the variable value hasn't changed for the given second, it can take the last logged value for that variable
It must contain data only for a selected period of time (e.g.: from 1/8/2014 1:30:00 AM to 1/8/2014 3:45:00 AM) - I have the fields for selection in the UI
The user must be prompted to save this consolidated DB
The DB should be optimized in order to reduce its size after all data is copied to it
I know it's not too complex, but I want an opinion on the best way to deal with all this data. The source databases might have more than 1 GB each (many many days of log). I'll usually get only 3~4 hours of data from them into the output DB, but it'll be 14000+ rows (one per second) with 512 cols, parsed cell by cell...I imagine that's a lot to process, right?
My idea is to:
Establish connection with the 4 source DBs (they are located in one fixed directory)
Select the data to be extracted from each DB (based on the UI Start and End datetime fields) and place it in one large DataTable (SourceData)
Once SourceData is populated, close connection with the source DBs
Create 3 output DataTables (OutputData) with an algorithm that parses each line from SourceData on a second-by-second basis and place it in the right row/column (based in the timestamp and point_id source columns) - and if there's no data for any given point in time, repeat the value from the previous second
Open connection to an output DB (supposedly empty), or create one, if possible
Check if there's any table there and drop them, if true
Create 3 tables to contain all cols (timestamp being the primary for all 3 tables)
Populate these tables with the data from OutputData
Optimize the tables to reduce size
Save the DB to a backup folder and prompt the user to save the DB in another place as well, and displaying the final file size
Clear both the SourceData and DataTable to clear RAM usage
Is there a more efficient/easy way to achieve my goal? At first I was going for immediate read/write to/from the DBs, but I figured working with variables inside the executable would be a lot faster that file I/O...
Thank you all in advance!

Related

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

EF Code First - Change column max length without maxing out DB

I am dealing with a problem that most of our columns were created with default EF behaviour which makes string as nvarchar(max). However that doesn't combine well with indexes.
I tried the putting the [MaxLength(100)] attribute onto the specific column and generate a migration. That generates the alter table statement that when run on a database (with a lot of data) spikes the DTU and basically trashes the DB.
I am now looking for a safe way how to proceed with this (let's say that the column name is "FileName"):
Create a column FileNameV2 with [MaxLength(100)].
Copy data from FileName column to FileNameV2.
Delete FileName column.
Rename FileNameV2 to FileName
Would this approach work or is there any better / easier way (especially one that doesn't upset EF)?
The main issue I found out later was that our SQL Azure database had max size 2 GB so when I was doing the change and the db had 1,5 GB it then reached its size probably when doing the transition from navarchar(max) to nvarchar(100). So the learning is to double check your max size of DB on Azure just to be sure you don't hit this threshold.

How to retrieve newest row in an Azure Table?

I am trying to retrieve the newest row created in the Primary Minute Metrics tables that is automatically created by Azure. Is there any way to do this without scanning through the whole table? The partition key is basically the timestamp in a different format. For example:
20150811T1250
However, there is no way for me to tell what the latest partitionkey is, so I can't just query by partition. Also, the row key is useless since all the rows have the same rowkey. I am completely stumped on how I would do this even though it seems like a really basic thing to do. Any ideas?
An example of a few partition keys of rows in the table:
20150813T0623
20150813T0629
20150813T0632
20150813T0637
20150813T0641
20150813T0646
20150813T0650
20150813T0654
EDIT: As a followup question. Is there a way to scan the table backwards? That would allow me to just get the first row scanned since that would be the latest row.
When it comes to querying data, Azure Tables offer very limited choices. Given that you know how the PartitionKey gets assigned (YYYYMMDDTHHmm format), one possible solution would be to query from current date/time (in UTC) minus some offset to current date/time and go from there.
For example, assuming start time is 03-Dec-2015 00:00:00. What you could do is try to fetch data from 02-Dec-2015 23:00:00 to 03-Dec-2015 00:00:00 and see if any records are returned. If the records are returned, you can simply take the last entry in the resultset and that would be your latest entry. If no records are found, then you move back by 1 hour (i.e. from 02-Dec-2015 22:00:00 to 02-Dec-2015 23:00:00) and fetch records again and repeat this till the time you find matching result.
Yet another idea (though a bit convoluted one) is to create another table and periodically copy the data from the main table to this new table. When you copy the data, what you would need to do is take the PartitionKey value, create a Date/Time object out of it, subtract that from DateTime.MaxValue. Calculate the ticks for this new value and use that as the PartitionKey for your new entity (you would need to convert that ticks into string and do some string prepadding so that all values are of same length). Now the latest entries will always be on the top.

Optimizing SDF filesize

I recently started learning Linq and SQL. As a small project I'm writing a dictionary application for Windows Phone. The project is split into two Applications. One Application (that currently runs on my PC) generates a SDF file on my PC. The second App runs on my Windows Phone and searches the database. However I would like to optimize the data usage. The raw entries of the dictionary are written in a TXT file with a filesize of around 39MB. The file has the following layout
germanWord \tab englishWord \tab group
germanWord \tab englishWord \tab group
The file is parsed into a SDF database with the following tables.
Table Word with columns _version (rowversion), Id (int IDENTITY), Word (nvarchar(250)), Language (int)
This table contains every single word in the file. The language is a flag from my code that I used in case I want to add more languages later. A word-language pair is unique.
Table Group with columns _version (rowversion), GroupId (int IDENTITY), Caption (nvarchar(250))
This table contains the different groups. Every group is present one time.
Table Entry with columns _version (rowversion), EntryId (int IDENTITY), WordOneId (int), WordTwoId(int), GroupId(int)
This table links translations together. WordOneId and WordTwoId are foreign keys to a row in the Word Table, they contain the id of a row. GroupId defines the group the words belong to.
I chose this layout to reduce the data footprint. The raw textfile contains some german (or english) words multiple times. There are around 60 groups that repeat themselfes. Programatically I reduce the wordcount from around 1.800.000 to around 1.100.000. There are around 50 rows in the Group table. Despite the reduced number of words the SDF is around 80MB in filesize. That's more than twice the size of the the raw data. Another thing is that in order to speed up the searching of translation I plan to index the Word column of the Word table. By adding this index the file grows to over 130MB.
How can it be that the SDF with ~60% of the original data is twice as large?
Is there a way to optimize the filesize?
The database file must contain all of the data from your raw file, in addition to row metadata -- it also will contain the strings based on the datatypes specified -- I believe your option here is NVARCHAR which uses two bytes per letter. Combining these considerations, it would not surprise me that a database file is over twice as large as a text file of the same data using the ISO-Latin-1 character set.

how to implement oracle -> oracle conversion/refresher program in C# / ADO.NET 2.0

When program runs 1st time it just gets some fields from a source database table say:
SELECT NUMBER, COLOR, USETYPE, ROOFMATERIALCODE FROM HOUSE; //number is uniq key
it does some in-memory processing say converting USETYPE and ROOFMATERIAL to destination database format (by using cross ref table).
Then program inserts ALL THE ROWS to destination database:
INSERT INTO BUILDING (BUILDINGID, BUILDINGNUMBER, COLOR, BUILDINGTYPE, ROOFMAT)
VALUES (PROGRAM_GENERATED_ID, NUMBER_FROM_HOUSE, COLOR_FROM_HOUSE,
CONVERTED_USETYPE_FROM_HOUSE, CONVERTED_ROOFMATERIALCODE_FROM_HOUSE);
The above is naturally not SQL but you get the idea (the values with underscores just describe the data inserted).
The next times the program should do the same except:
insert only the ones not found from target database.
update only the ones that have updated color, usetype, roofmaterialcode.
My question is:
How to implement this in efficient way?
-Do I first populate DataSet and convert fields to destination format?
-If I use only 1 DataSet how give destination db BUILDING_IDs (can i add columns to populated DataSet?)
-How to efficiently check if destination rows need refresh (if i select them one # time by BUILDING_NUMBER and check all fields it's gonna be slow)?
Thanks for your answers!
-matti
If you are using Oracle, have you looked at the MERGE statement? You give the merge statement a criteria. If records match the criteria, it performs an UPDATE. If they don't match the criteria (they aren't already in the table), it performs an INSERT. That might be helpful for what you are trying to do.
Here is the spec/example of merge.

Categories