SSIS Split rows in a flatfile using conditional split or script - c#

I'm new to SSIS, your idea or solution is greatly appreciated.
I have a flat file with the first row as the file details(not the header). The second row onwards is the actual data.
Data description
First-row format= Supplier_name, Date, number of records in the file
eg:
Supplier_name^06022017^3
ID1^Member1^NEW YORK^050117^50.00^GENERAL^ANC
ID2^Member2^FLORIDA^050517^50.00^MOBILE^ANC
ID3^Member3^SEATTLE^050517^80.00^MOBILE^ANC
EOF
Problem
Using SSIS I want to split the First row into output1 and second row onwards into output2.
With the help of conditional split, I thought I can do this. But I'm not sure what condition to give in order to split the rows. Should I try with multicast?
Thanks

I would handle this by using a script task (BEFORE the dataflow) to read the first row and do whatever you want with it.
Then in the dataflow task, I would set the flat file source to ignore the first row and import the second row on as data.

Thank you all. Here is an alternative solution
I used a script component in SSIS to do this.
Step1: Create a variable called RowNumber.
Step2: Then add a script component which will add an additional column and increments row numbers.
SSIS Script component
private int m_rowNumber;
public override void PreExecute()
{
base.PreExecute();
m_rowNumber = 0;
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
m_rowNumber++;
Row.RowNumber = m_rowNumber;
}
Step3: Use the output of Script component as the input of conditional split and create a condition with RowNumber == 1.
The Multicast will split the data accordingly.

I would first make sure that you have the correct number of columns in your Flat File Connection:
Edit the Flat File Connection -> Advanced Tab press the New button to add columns. In your example you should have 7, Column 0 to Column 6.
Now add a conditional split and ass two case statements:
Output Name Condition
HeaderRow [Column 0] == "Supplier_Name"
DetailRow [Column 0] != "Supplier_Name"
Now route these to the Output 1 and Output 2

Expanding on Tab Allerman's answer.
For our project, we used a power shell script component inside an Execute process task which runs a simple power shell command to grab the first line of the file.
See this MSDN blog on how to run power shell script.
Power shell script to get the first line
Get-Content C:\foo\yourfolderpath\yourfilename.txt -First 1
This note only helps in case like yours, but generically helps in avoiding processing large files (in GBs and upwards) which have incorrect header. This simple power shell executes in milliseconds as opposed to most of the processes/scripts which will require to load a full file into memory, slowing things down.

Related

How to use a string content as a variable name in a SSIS C# script

Hello everyone and thanks for your time and answers in advanced.
Let me give you some context first:
I'm working in a bank in a metrics project. We are re-engineering all the ETL processes and microstrategy dashboards of the commerce sector, they use a lot of non IT data sources and we have to map that info to IT centralized sources in SQL Server 2008R2 servers. For the ETL we are using SSIS.
I have an ETL for loans. Inside a data flow I gather all the information I need about loans, then from one particular table I got all the conditions that needs to be tested to classify the loan. The conditions table has this form:
sk_condition_name: varchar
sk_whatever: ...
...
where_clause: varchar(900)
In the "where_clause" column I have a where clause (duh!) that test some columns from the loan like this:
loan_type = x AND client_tipe = y AND loan_rate = z
Before I get deeper in this, I need to say that the example I'm giving is about loans, but the same goes for all the products the bank sell, like insurance or investment funds... And the conditions to classify the product can change in time. And one specific loan can be classified in multiple ways at the same time, each positive clasification writes a row in a specific table, that's why I need an asynchronous Script Component.
Where I was? Right, loans.. So in the ETL in get all the loans data and those where_clauses, in a C# Script Component we separate the clause with regular expressions, so we end up with 2 strings for every check that the clause was doing, using the example above I would end up with 3 pair of strings ("loan_type", "x"), ("client_type","y") and ("loan_rate",z).
And this is where the problem comes
I can't find a way in the script to use the first string content as the name of the row column, something like this is what I mean:
if Row.(string1.content()) = string2 then ...
Now the limitations:
It's a bank, they don't like new things, so the tools I can use are those of SSIS.
Changes in the model might be out of discussion, are out of discussion.
I need this to be completely dynamic, no hardcoded conditions because of the changing nature of this conditions.
I've search a lot for a solution to this but found non that works. This is my last resource.
I hope I have been decently clear in this, my first post ever.
Please please please help me!
Thank you very much!!
EDIT 1: To clarify..
My end result is generating a new row to be inserted in one particular table for each condition that tested positive. The information to be inserted and the target table are irrelevant to the problema in hands. The loan type, client and rate are just examples of what conditions test. My problema is that I can't use the string content as the name for the row's column.
You can do this with Reflection.
Add "using System.Reflection;" to the namespaces - then you can interate with the following code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string sColumn = "Name";
string sFind = "John";
foreach (PropertyInfo p in Row.GetType().GetProperties())
{
string sval;
if (p.Name.ToString() == sColumn)
{
sval = p.GetValue(Row, null).ToString();
if (sval != sFind)
{
//Do Stuff
}
}
}
}
In This example I have hard-coded String1 (the Column to Check) to Name and String2 (the Value to Check) to John.

SSIS load 2 columns of a flat file (first non null only) into a variable

I have a flat file with the following columns
SampleID Rep_Number Product Protein Fat Solids
In the flat file SampleID and Product are populated in the first row only, rest of the rows only have values for Rep_Number, Protein, Fat, Solids. SampleID and Product are blank for the rest of the rows. So my task is to fill those blank rows with the first row that has the sampleID and Product and load into the table.
So task is to pick the first non null SampleID and Product from the flat file and put them in a variable. And rest is all configured. If I can pick the first non null SampleID and Product directly from the flat file and put into their respective variables I can take it from there. That is all I need.
I can connect a script component to flat file source in a data flow task. I need help with the script to pick the first non null values (SampleID and Product),
Need help please. Thanks in advance.
If you are sure you need to store the data of the 1st row - 1st 2 columns' values in variables and take it from there, and DO NOT REQUIRE a change in your original approach, then try this:
You need to have a NEW variable to keep track of the ROW COUNT. Let this be an integer and set it to 0. This will help process only the first row and skip the rest. Let's call this Row_Count
After you retrieve data from the component connected to flat file as the source, connect it to a 'Script Component' and click 'Edit'.
In the 'Script Transformation Editor'click the 'Input Columns' on the left and select the desired columns (say Column_Name1 and Column_Name2) you want to retrieve the value from (i.e. 1st and 2nd)
Click 'Script' on the left
Under 'Custom Properties', expand the 'ReadWriteVariables'. Add the 2 variables you intend to use for storing the values AND the Row_Count variable.
Click 'Edit Script.
In the editor that opens, double click 'ScriptMain.vb' on the right.
Under the Public Overrides Sub PostExecute() {} procedure type this:
If Variables.Row_Count = 0 Then
Variables.Your_Variable1 = Row.Column_Name1
Variables.Your_Variable2 = Row.Column_Name2
Variables.Row_Count= Variables.Row_Count + 1
End If
You have the desired values in your variables, proceed with the rest of your logic.
Note:
If you do not add the variables to the 'ReadWriteVariables', you will not be able to access them in the script.
Based on any other code you might add in the script, you would need to include additional headers if they are not present.
Please mark my post as answer if it helps :)

c# ssis - script component within data flow

I'm trying to parse values from a text file using c#, then append it as a column to the existing data set coming from the same text file. Example file data:
As of 1/31/2015
1 data data data
2 data data data
So I want to use a script component within an ssis data flow to append the 1/31/2015 value as a 4th column. This package iterates through several files, so I would like this to take place within each dataflow. I have no issues with getting the rest of the data into the database into individual columns, but I parse most of those out using tsql after fast loading everything as one big column into the database.
edit:
here is the .NET code I used to get it started, I know this is probably far from optimal, but I actually want to parse two values from the resulting string, and that would be easy to do with regular expressions
string every = "";
string[] lines = File.ReadLines(filename).Take(7).ToArray();
string.Join(",", lines);
every = lines.ToString();
MessageBox.Show(every).ToString();

SSIS and specflow

I wanted to know ,
I have a ssis package which can
1) read multiple input files
2) store the data from files to dB
3) archive the input files
I have to write functional test using specflow
One of my test case is :
To check the row count in the table in dB to be equal the summation of all lines in each input file read.
I am not sure how I can achieve this. Can anyone help me in :
how to get the summation of lines in each file.
To check the row count in the table in dB
Add a variable to your SSIS package, named something like iRowCount.
In the Data Flow Task where DB is the source, add a RowCount control, and assign the value to variable iRowCount.
to be equal the summation of all lines in each input file read.
Same concept, another two variables, named something like iFilesRowCount and iFilesRowCountTotal.
Then you'll have to pull off in each data pump a RowCount control, assigning the value to iFilesRowCount. Then outside of each data pump a script task that does an iFilesRowCount = iFilesRowCount + iFilesRowCountTotal.
Then somewhere towards the bottom a script task to perform the comparison between iRowCount (db) and iFilesRowCountTotal, and in the arrows exiting that task create precedence constraints to pull off a 'True' path and 'False' path.

SSIS - How do I load data from text files where the path of files is inside another text file?

I have a text file that contains a list of files to load into database.
The list contains two columns:
FilePath,Type
c:\f1.txt,A
c:\f2.txt,B
c:\f3.txt,B
I want to provide this file as the source to SSIS. I then want it to go through it line by line. For each line, I want it to read the file in the FilePath column and check the Type.
If type is A then I want it to ignore the first 4 lines of the file that is located at the FilePath column of the current line and then load rest of the data inside that file in a table.
If type is B then I want it to open the file and copy first column of the file into table 1 and second column into table 2 for all of the lines.
I would really appreciate if someone can please provide me a high level list of steps I need to follow.
Any help is appreciated.
Here is one way of doing it within SSIS. Below steps are with respect to SSIS 2008 R2.
Create an SSIS package and create three package variables namely FileName, FilesToRead and Type. FilesToRead variable will hold the list of files and their types information. We will have a loop that will go through each of those records and store the information in FileName and Type variables every time it loops through.
On the control flow tab, place a Data flow task followed by a ForEach Loop container. The data flow task would read the file containing the list of files that has to be processed. The loop would then go through each file. Your control flow tab would finally look something like this. For now, there will be errors because nothing is configured. We will get to that shortly.
On the connection manager section, you need four connections.
First, you need an OLE DB connection to connect to the database. Name this as SQLServer.
Second, a flat file connection manager to read the file that contains the list of files and types. This flat file connection manager will contain two columns configured namely FileName and Type Name this as Files.
Third, another flat file connection manager to read all files of type A. Name this as Type_A. In this flat file connection manager, enter the value 4 in the text box Header rows to skip so that the first four rows are always skipped.
Fourth, one more flat file connection manager to read all files of type B. Name this as Type_B.
Let's get back to control flow. Double-click on the first data flow task. Inside the data flow task, place a flat file source that would read all the files using the connection manager Files and then place a Recordset Destination. Configure the variable FilesToRead in the recordset destination. Your first data flow task would like as shown below.
Now, let's go back to control flow tab again. Configure the ForEach loop as shown below. This loop will go through the recordset stored in the variable FilesToRead. Since, the recordset contains two columns, each time a record is looped through, the variables FileName and Type will be assigned the value of the current record.
Inside, the for each loop container, there are two data flow tasks namely Type A files and Type B files. You can configure each of these data flow tasks according to your requirements to read the files from connection managers. However, we need to disable the tasks based on the file that is being read.,
Type A files data flow task should be enabled only when A type files are being processed.
Similarly, Type B files data flow task should be enabled only when B type files are being processed.
To achieve this, click on the Type A files data flow task and press F4 to bring the properties. Click on the Ellipsis button available on the Expression property.
On the Property Expressions Editor, select Disable Property and enter the expression !(#[User::Type] == "A")
Similarly, click on the Type B files data flow task and press F4 to bring the properties. Click on the Ellipsis button available on the Expression property.
On the Property Expressions Editor, select Disable Property and enter the expression !(#[User::Type] == "B")
Here is a sample Files.txt containing only A type file in the list. When the package is executed to read this file, you will notice that only the Type A files data flow task.
Here is another sample Files.txt containing only B type files in the list. When the package is executed to read this file, you will notice that only the Type B files data flow task.
If Files.txt contains both A and B type files, the loop will execute the appropriate data flow task based on the type of file that is being processed.
Configuring Data Flow task Type A files
Let's assume that your flat files of type A have three column layout like as shown below with comma separated values. The file data here is shown using Notepad++ with all special characters. CR LF denotes that the lines are ending with Carriage return and Line Feed. This file is stored in the path C:\f1.txt
We need a table in the database to import the data. Let's create a table named dbo.Table_A in the SQL Server database as shown here.
Now, go to the SSIS package. Here are the details to configure the Flat File connection manager named Type_A. Give a name to the connection manager. You need specify the value 4 in the Header rows to skip textbox. Your flat file connection manager should look something like this.
On the Advanced tab, you can rename the column names if you would like to.
Now that the connection manager is configured, we need to configure data flow task Type A files to process the corresponding files. Double-click on the data flow task Type A files. Place a Flat file source and OLE DB Destination inside the task.
The flat file source has to be configured to read the files from flat file connection manager.
The data flow task doesn't do anything special. It simply reads the flat files of type A and inserts the data into the table dbo.Table_A. Now, we need to configure the OLE DB Destination to insert the data into database. The column names configured in the flat file connection manager and the table are not same. So, they have to be mapped manually.
Now, that the data flow task is configured. We have to make that the file path being read from the Files.txt is passed correctly. To do this, click on the Type_A flat file connection manager and press F4 to bring the properties. Set the DelayValidation property to True. Click on the Ellipsis button on the Expressions property.
On the Property Expression builder, select ConnectionString property and set it to the Expression #[User::FileName]
Here is a sample Files.txt file containing Type A files only.
Here are the sample type A files f01.txt and f02.txt
After the package execution, following data will be found in the table Table_A
Above mentioned configuration steps have to be followed for Type B files. However, the data flow task would look slightly different since the file processing logic is different. Data flow task Type B files would something like this. Since you have to insert the two columns in type B files into different tables. You have to use Multicast transformation that would create clones of the input data. You could use each of the multicast output to pass through to a different transformation or destination.
Hope that helps you to achieve your task.
I would recommend that you create a SSIS package for each different type of file load you're going to do. You can execute those packages from another program, see here: How to execute an SSIS package from .NET?
Given this information, you can write a quick program to execute the relevant packages:
var jobs = File.ReadLines("C:\\temp\\order.txt")
.Skip(1)
.Select(line => line.Split(','))
.Select(tokens => new { File = tokens[0], Category = tokens[1] });
foreach (var job in jobs)
{
// execute the relevant package for job.Category using job.File
}
My solution would look like N + 1 flat file Connection Managers to handle the source files. CM A would address the skip first 4 rows file format, B sounds like it's just a 2 column file, etc. The last CM would be used to parse the command file you've illustrated.
Now that you have all of those Connection Managers defined, you can go about the processing logic.
Create 3 variables. 2 of type string (CurrentPath, CurrentType). 1 is of type Object and I called it Recordset.
The first Data Flow reads all the rows from the flat file source using "CM Control." This is the data you supplied in your example.
We will then use that Recordset object as the source for a ForEach Loop Container in what is commonly referred to as shredding. Bingle the term "Shred recordset ssis" and you're bound to hit a number of articles describing how to do it. The net result is that for each row in that source CM Control file, you will assign those values into the CurrentPath, CurrentType variables.
Inside that Loop container, create a central point for control for control to radiate out. I find a script task works wonderfully for this. Drag it onto the canvas, give it a strong name to indicate it's not used for anything and then create a data flow to handle each processing permutation.
The magic comes from using Expressions. Dang near everything in SSIS can have expressions set on their properties which is what separates the professionals from the poseurs. Here, we will double click on the line connecting to a given data flow and change the constraint type from "Constraint" to "Expression and Constraint" The Expression you would then use is something like #[User::CurrentType] == "A" This will ensure that path is only taken when both the parent task Succeeded and the condition is true.
The second bit of expression magic will be applied to the connection managers themselves. They will need to have their ConnectionString property driven by the value of the #[User::CurrentFile] property. This will allow a design-time value of C:\filea.txt but would allow a runtime value, from the control file, to be \\network\share\ClientFileA.txt Unless all the files have the same structure, you'll most likely need to set DelayValidation to True in the properties. Otherwise, SSIS will fail PreValidation as all the "CM A" to "CM N" would be using that CurrentFile variable which may or may not be a valid connection string for that file layout.

Categories