Visual Studio SSIS Script Transformation: Get Average and Count - c#

I'm new to VS and SSIS. I'm trying to consolidate data that I have in visual studio using the Script Component Transformation Editor that currently looks like this:
date type seconds
2016/07/07 personal 400
2016/07/07 business 300
2016/07/07 business 600
and transform it to look like this:
date type avgSeconds totalRows
2016/07/07 personal 400 1
2016/07/07 business 450 2
Basically, counting and taking the average for type and date.
I tried both VB.net and C# options and am open to either (new to both). Does anyone have any ideas on how i can do this?
I'm using VS 2012. I'm thinking i need to create a temp or buffer table to keep count of everything as I go through the Input0_ProcessInputRow and then write it to the output at the end, I just can't figure out how to do this. Any help is much appreciated!!

Based on your comments this might work.
Set up a data flow task that will do a merge join to combine the data
https://www.simple-talk.com/sql/ssis/ssis-basics-using-the-merge-join-transformation/
Send the data to a staging table. This can be created automatically by the oledb destination if you click 'New' next to table drop down. If you plan to rerun this package then you will need to add an execute sql task before the data flow task to delete from or truncate this staging table.
Create an execute sql task with your aggregation query. Depending on your situation an insert could work. Something like:
Insert ProductionTable
Select date, type, AVG(Seconds) avgSeconds, Vount(*) totalRows
From StagingTable
Group By date, type
You can also use the above query, minus the insert production table, as a source in a data flow task in case you need to apply more transforms after the aggregation.

Related

SSIS Get List of Missing Files

I get a monthly submission of files from various companies that get loaded into SQL Server via an SSIS job. When the SSIS job runs I want to get a list of companies that did not submit a file. The files will have the date appended to the end so I'm assuming it will need to do some sort of wild card search through a list of names. If I'm expecting:
AlphaCO_File_yyyymmdd
BetaCO_File_yyyymmdd
DeltaCO_File_yyyymmdd
ZetaCO_File_yyyymmdd
and the file from ZetacO is missing I want to write ZetaCO to a table, or save it in a variable I can use in an email task.
I am using Visual Studio 2019 and SQL Server 2019. I have the Task Factory add-on for SSIS.
Note this answers uses psuedo code that needs to be tuned for your specific values.
My guess is that you already have a foreach loop set up where you are reading the file name to a parameter.
What you need is a table in SQL Server to compare against.
CompanyName, FileSubmitted (bit)
AlphaCO
BetaCo
DeltaCo
ZetaCO
First step is SQL command: udpate table set FileSubmitted = 0.
Then in each for loop have one path to update that table based on file name:
Use token or use C# task. c# would be company = fileName.Split('_')[0];
And then update the table:
update table
set FileSubmitted = 1
where CompanyName = company
Now you can use that table for emails.

I need to create a new ETL C# process with large .CSV files to SQL Tables Add/Update records

I need to bring in a number of .CSV files into unique keyed SQL Tables (Table names and column names match from source to target). I started looking at libs like Cinchoo-ETL, but I need to do an "Upsert" Meaning Update if record is present insert if it's not present. I'm not sure if Cinchoo-ETL or some other lib has this feature built in.
For example lets say the SQL Server Customer table has some records in it, Cust# is a primary Key
Cust# Name
1 Bob
2 Jack
The CSV file looks something like this:
Cust#,Name
2,Jill
3,Roger
When the ETL program runs it needs to update Cust# 2 from Jack to Jill and insert a new cust# 3 record for Roger.
Speed reusability is important as there will be 80 or so different tables, some of the tables can have several million records in them.
Any ideas for a fast easy way to do this? Keep in mind I'm not a daily developer so examples would be awesome.
Thanks!
You are describing something than can be done with a tool I developped. It's called Fuzible (www.fuzible-app.com) : in Synchronization mode, it allows you to choose the behavior of the target table (allow INSERT, UPDATE, DELETE) and your Source can be any CSV file and your target, any database.
You can contact me from the website if you need an how-to.
The software is free :)
What you have to do is to create a Job with your CSV path as a Source connection, then your Database as the Target connection.
Choose the "Synchronization" mode, which, by opposition with the "Replication" mode will compare Source and Target data.
Then, you can write as many queries as you want (one for each CSV file) like this :
MyOutputTable:SELECT * FROM MyCSVFile.CSV
No need to write more complex queries if both CSV and database table shares the same schema (same columns)
The software should be able to do the rest :) Update rows than need to be updated and create new rows if required.
This is what I did in a recent SSIS job. I load the data to a temp table just use a regular SQL query to perform this comparison. This may be cumbersome on tables with lots of fields.
-- SEE DIFFERENCES FOR YOUR AMUSEMENT
SELECT *
FROM Accounts a
INNER JOIN
DI_Accounts da
ON a.CustomerNumber = da.CustomerNumber AND (
a.FirstName <> da.FirstName
)
-- UPDATE BASED ON DIFFERENCES
UPDATE a
SET
a.FirstName = a.FirstName
FROM Accounts a
INNER JOIN
DI_Accounts da
ON a.ModelId = da.ModelId AND (
a.FirstName <> da.FirstName
)
I would recommend that you take a look at the nuget package ETLBox and the necessary extension packages for Csv & Sql Server (ETLBox.Csv + ETLBox.SqlServer).
This would allow you to write a code like this:
//Create the components
CsvSource source = new CsvSource("file1.csv");
SqlConnectionManager conn = new SqlConnectionManager("..connection_string_here..");
DbMerge dest = new DbMerge(conn, "DestinationTableName");
dest.MergeProperties.IdColumns.Add(new IdColumn() { IdPropertyName = "Cust#" });
dest.MergeMode = MergeMode.Full; //To create the deletes
dest.CacheMode = CacheMode.Partial; //Enable for bigger data sets
//Linking
source.LinkTo(dest);
//Execute the data flow
Network.Execute(source);
This code snipped would do the corresponding inserts/updates & deletes into the database table for one file. Make sure that the header names match exactly with the column names in your database table (case-sensitive). For bigger data sets you need to enable the partial cache, to avoid having all data loaded into Memory.
It will use dynamic objects under the hood (ExpandoObject). You can find more information about the merge and the tool on the website (www.etlbox.net)
The only downside is that ETLBox is not open source. But the package allows you to work with data sets up to 10.000 rows to check if it suits your needs.

SQL Server - Best practice to circumvent large IN (...) clause (>40000 items)

I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.
"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).
You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.

SQL - Better two queries instead of one big one

I am working on a C# application, which loads data from a MS SQL 2008 or 2008 R2 database. The table looks something like this:
ID | binary_data | Timestamp
I need to get only the last entry and only the binary data. Entries to this table are added irregular from another program, so I have no way of knowing if there is a new entry.
Which version is better (performance etc.) and why?
//Always a query, which might not be needed
public void ProcessData()
{
byte[] data = "query code get latest binary data from db"
}
vs
//Always a smaller check-query, and sometimes two queries
public void ProcessData()
{
DateTime timestapm = "query code get latest timestamp from db"
if(timestamp > old_timestamp)
data = "query code get latest binary data from db"
}
The binary_data field size will be around 30kB. The function "ProcessData" will be called several times per minutes, but sometimes can be called every 1-2 seconds. This is only a small part of a bigger program with lots of threading/database access, so I want to the "lightest" solution. Thanks.
Luckily, you can have both:
SELECT TOP 1 binary_data
FROM myTable
WHERE Timestamp > #last_timestamp
ORDER BY Timestamp DESC
If there is a no record newer than #last_timestamp, no record will be returned and, thus, no data transmission takes place (= fast). If there are new records, the binary data of the newest is returned immediately (= no need for a second query).
I would suggest you perform tests using both methods as the answer would depend on your usages. Simulate some expected behaviour.
I would say though, that you are probably okay to just do the first query. Do what works. Don't prematurely optimise, if the single query is too slow, try your second two-query approach.
Two-step approach is more efficient from overall workload of system point of view:
Get informed that you need to query new data
Query new data
There are several ways to implement this approach. Here are a pair of them.
Using Query Notifications which is built-in functionality of SQL Server supported in .NET.
Using implied method of getting informed of database table update, e.g. one described in this article at SQL Authority blog
I think that the better path is a storedprocedure that keeps the logic inside the database, Something with an output parameter with the data required and a return value like a TRUE/FALSE to signal the presence of new data

best way to split up a long file. Programming or SQL?

I have a database Table (in MS-Access) of GPS information with a record of Speed, location (lat/long) and bearing of a vehicle for every second. There is a field that shows time like this 2007-09-25 07:59:53. The problem is that this table has has merged information from several files that were collected on this project. So, for example, 2007-09-25 07:59:53 to 2007-09-25 08:15:42 could be one file and after a gap of more than 10 seconds, the next file will start, like 2007-09-25 08:15:53 to 2007-09-25 08:22:12. I need to populate a File number field in this table and the separating criterion for each file will be that the gap in time from the last and next file is more than 10 sec. I did this using C# code by iterating over the table and comparing each record to the next and changing file number whenever the gap is more than 10 sec.
My question is, should this type of problem be solved using programming or is it better solved using a SQL query? I can load the data into a database like SQL Server, so there is no limitation to what tool I can use. I just want to know the best approach.
If it is better to solve this using SQL, will I need to use cursors?
When solving this using programming (for example C#) what is an efficient way to update a Table when 20000+ records need to be updated based on an updated DataSet? I used the DataAdapter.Update() method and it seemed to take a long time to update the table (30 mins or so).
Assuming SQL Server 2008 and CTEs from your comments:
The best time to use SQL is generally when you are comparing or evaluating large sets of data.
Iterative programming languages like C# are better suited to more expansive analysis of individual records or analysis of rows one at a time (*R*ow *B*y *A*gonizing *R*ow).
For examples of recursive CTEs, see here. MS has a good reference.
Also, depending on data structure, you could do this with a normal JOIN:
SELECT <stuff>
FROM MyTable T
INNER JOIN MyTable T2
ON t2.timefield = DATEADD(minute, -10, t.timefield)
WHERE t2.pk = (SELECT MIN(pk) FROM MyTable WHERE pk > t.pk)

Categories