C# Transform Data in ETL Process - c#

I am learning the ETL process in C# and have already extracted and read the sample CSV data but am unsure at what to do to transform it properly.
I have been using this website as a reference at how to transform data, but I am unsure on how to apply it to my sample data (below).
name,gender,age,numKids,hasPet,petType
Carl,M,43,2,true,gecko
Jake,M,22,1,true,snake
Cindy,F,53,3,false,null
Matt,M,23,0,true,dog
Ally,F,28,1,false,null
Megan,F,42,2,false,null
Carly,F,34,4,true,cat
Neal,M,27,2,false,null
Tina,F,21,2,true,pig
Paul,M,1,3,true,chicken
Below is how I extracted the data from the CSV file using CSVHelper
using (FileStream fs = File.Open(#"C:\Users\Grant\Documents\SampleData4.csv", FileMode.Open, FileAccess.Read))
using (StreamReader sr = new StreamReader(fs))
{
CsvConfiguration csvConfig = new CsvConfiguration()
{BufferSize = bufferSize, AllowComments = true};
using (var csv = new CsvReader(sr, csvConfig))
{
while (csv.Read())
{
var name = csv.GetField<string>(0);
var gender = csv.GetField<string>(1);
var age = csv.GetField<int>(2);
var numKids = csv.GetField<int>(3);
var hasPet = csv.GetField<bool>(4);
var petType = csv.GetField<string>(5);
}
}
}
If you need me to provide additional details, just ask below.

Although a little late, I still would like to add an answer:
To create you own ETL process and Data Flow with C#, I would recommend you the nuget package ETLBox (https://etlbox.net). It will enable you to write a ETL data flow, where the CSV reader implementation is already wrapped in a CSVSource object. E.g., you would have to do the following to load data from a CSV into a database:
Defina a CSV source
CSVSource sourceOrderData = new CSVSource("demodata.csv");
Optionally define a row transformation:
RowTransformation<string[], Order> rowTrans = new RowTransformation<string[], Order>(
row => new Order(row)
);
Define the destination
DBDestination<Order> dest = new DBDestination<Order>("dbo.OrderTable");
Link your ETL data pipeline together
sourceOrderData.LinkTo(rowTrans);
rowTrans.LinkTo(dest);
Finally start the dataflow (async) and wait for all data to be loaded.
source.Execute();
dest.Wait();

Related

ZipArchive, update entry: read - truncate - write

I'm using System.IO.Compression's ZipArchive to modify a file within a ZIP. I need first to read the whole content (JSON), transform the JSON then truncate the file and write the new JSON to the file. At the moment I have the following code:
using (var zip = new ZipArchive(new FileStream(zipFilePath, FileMode.Open, FileAccess.ReadWrite), ZipArchiveMode.Update))
{
using var stream = zip.GetEntry(entryName).Open();
using var reader = new StreamReader(stream);
using var jsonTextReader = new JsonTextReader(reader);
var json = JObject.Load(jsonTextReader);
PerformModifications(json);
stream.Seek(0, SeekOrigin.Begin);
using var writer = new StreamWriter(stream);
using var jsonTextWriter = new JsonTextWriter(writer);
json.WriteTo(jsonTextWriter);
}
However, the problem is: if the resulting JSON is shorter than the original version, the remainder of the original is not truncated. Therefore I need to properly truncate the file before writing to it.
How to truncate the entry before writing to it?
You can either delete the entry before writing it back or, which I prefer, use stream.SetLength(0) to truncate the stream before writing. (See also https://stackoverflow.com/a/46810781/62838.)

How get a named range from an excel sheet using ExcelDataReader.Dataset

I am using the ExcelDataReader.Dataset package to read in .xlsx files and store specific sheets as a DataTable, like so:
public void SelectWorkbookClick(string s)
{
string fileName = string.Format("{0}\\{1}", WorkbookFolder, Workbook);
using (var stream = File.Open(fileName, FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
FrontSheet = reader.AsDataSet().Tables[FrontSheetName];
RearSheet = reader.AsDataSet().Tables[RearSheetName];
}
}
}
This works perfectly for reading in sheets, however my .xlsx file has named ranges in which I need to access.
I have had a look around and cannot not find any support for this, does anyone know of anyways I could go around this?

How to decompress .zip files in c# without extracting to new location

How can I decompress (.zip) files without extracting to a new location in the .net framework? Specifically, I'm trying to read a filename.csv.zip into a DataTable.
I'm aware of .extractToDirectory (which is within ZipArchive) but I just want to extract it into an object in c# and I would like to not create a new file.
Hoping to be able to do this w/o third party libraries, but I'll take what I can get.
May be some bugs because I never tested this, but here you go:
List<byte[]> urmom = new List<byte[]>();
using (ZipArchive archive = ZipFile.OpenRead(zipPath))
foreach (ZipArchiveEntry entry in archive.Entries)
using (StreamReader r = new StreamReader(entry.Open()))
urmom.Add(r.ReadToEnd(entry));
Basically you use the ZipArchive's openread class to iterate through each entry. At this point, you can use the streamreader to read each entry. From there you can create a file from the stream and even read the filename if you want to. My code doesn't do this, a bit of laziness on my part.
Keep in mind that a compressed stream might contain multiple files. To resolve this is required to iterate through all entries of zip file in order to retrieve them and treat separately.
The sample bellow converts a sequence of bytes in a list of string where each one is the context of the files included in zipped folder:
public static IEnumerable<string> DecompressToEntriesTextContext(byte[] input)
{
var zipEntriesContext = new List<string>();
using (var compressedStream = new MemoryStream(input))
using (var zip = new ZipArchive(compressedStream, ZipArchiveMode.Read))
{
foreach(var entry in zip.Entries)
{
using (var entryStream = entry.Open())
using (var memoryEntryStream = new MemoryStream())
using (var reader = new StreamReader(memoryEntryStream))
{
entryStream.CopyTo(memoryEntryStream);
memoryEntryStream.Position = 0;
zipEntriesContext.Add(reader.ReadToEnd());
}
}
}
return zipEntriesContext;
}

How can I save entire MongoDB collection to json/bson file using C#?

I have process, which first, generates lots of data which is save into mongoDB collection, then data is analyzed, and last - I want to save the whole collection to file on disk, and erase the collection.
I know I could do it easily with MongoDump.exe, but I was wondering is there any way to do it directly from c#? - I mean not running console precess with - but using some functionality that is inside MOngo C# driver.
And, if it can be done - how would I do the reverse operation in c# ? - namely: loading .bson file into collection?
Here's two methods that you can use to accomplish this:
public static async Task WriteCollectionToFile(IMongoDatabase database, string collectionName, string fileName)
{
var collection = database.GetCollection<RawBsonDocument>(collectionName);
// Make sure the file is empty before we start writing to it
File.WriteAllText(fileName, string.Empty);
using (var cursor = await collection.FindAsync(new BsonDocument()))
{
while (await cursor.MoveNextAsync())
{
var batch = cursor.Current;
foreach (var document in batch)
{
File.AppendAllLines(fileName, new[] { document.ToString() });
}
}
}
}
public static async Task LoadCollectionFromFile(IMongoDatabase database, string collectionName, string fileName)
{
using (FileStream fs = File.Open(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
var collection = database.GetCollection<BsonDocument>(collectionName);
string line;
while ((line = sr.ReadLine()) != null)
{
await collection.InsertOneAsync(BsonDocument.Parse(line));
}
}
}
And here's an example of how you would use them:
// Obviously you'll need to change all these values to your environment
var connectionString = "mongodb://localhost:27017";
var database = new MongoClient(connectionString).GetDatabase("database");
var fileName = #"C:\mongo_output.txt";
var collectionName = "collection name";
// This will save all of the documents in the file you specified
WriteCollectionToFile(database, collectionName, fileName).Wait();
// This will drop all of the documents in the collection
Task.Factory.StartNew(() => database.GetCollection(collectionName).DeleteManyAsync(new BsonDocument())).Wait();
// This will restore all the documents from the file you specified
LoadCollectionFromFile(database, collectionName, fileName).Wait();
Note that this code was written using version 2.0 of the MongoDB C# driver, which you can obtain via Nuget. Also, the file reading code in the LoadCollectionFromFile method was obtained from this answer.
You can use C# BinaryFormatter to serialize object graph to disk. Also you can deserialize back to object graph.
Serialize:
https://msdn.microsoft.com/en-us/library/c5sbs8z9%28v=VS.110%29.aspx
Deserialize:
https://msdn.microsoft.com/en-us/library/b85344hz%28v=vs.110%29.aspx
However that is not mongodb or C# driver feature.
After serializing you can use the driver to drop the collection. And after deserializing you can use the driver to insert objects into a new collection.
Based on your rules, you may want to do some locking on that collection at the time you are doing the export process before you drop it.

Streaming files from amazon s3 with seek possibility in C#

I need to work with huge files in Amazon S3. How can I get part of huge file from S3? Best way would be get stream with the seek possibility.
Unfortunately, CanSeek property of response.ResponseStream is false:
GetObjectRequest request = new GetObjectRequest();
request.BucketName = BUCKET_NAME;
request.Key = NumIdToAmazonKey(numID);
GetObjectResponse response = client.GetObject(request);
You could do following to read a certain part of your file
GetObjectRequest request = new GetObjectRequest
{
BucketName = bucketName,
Key = keyName,
ByteRange = new ByteRange(0, 10)
};
See the documentation
I know this isn't exactly what OP is asking for but I needed a seekable s3 stream so I could read Parquet files without downloading them so I gave this a shot here: https://github.com/mukunku/RandomHelpers/blob/master/SeekableS3Stream.cs
Performance wasn't as bad as I expected. You can use the TimeWastedSeeking property to see how much time is being wasted by allowing Seek() on an s3 stream.
Here's an example on how to use it:
using (var client = new AmazonS3Client(credentials, Amazon.RegionEndpoint.USEast1))
{
using (var stream = SeekableS3Stream.OpenFile(client, "myBucket", "path/to/myfile.txt", true))
{
//stream is seekable!
}
}
After a frustrating afternoon with the same problem I found the static class AmazonS3Util
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/S3/TS3Util.html
Which has a MakeStreamSeekable method.
Way late for the OP, but I've just posted an article and code demonstration of a SeekableS3Stream that performs reasonably well in real-world use cases.
https://github.com/mlhpdx/seekable-s3-stream
Specifically, I demonstrate reading a single small file from a much larger ISO disk image using the DiscUtils library unmodified by implementing a random-access stream that uses Range requests to pull sections of the file as-needed and maintains them in an MRU list to prevent re-downloading ranges for hot data structures in the file (e.g. zip central directory records).
The use is similarly simple:
using System;
using System.IO;
using System.Threading.Tasks;
using Amazon.S3;
using DiscUtils.Iso9660;
namespace Seekable_S3_Stream
{
class Program
{
const string BUCKET = "rds.nsrl.nist.gov";
const string KEY = "RDS/current/RDS_ios.iso"; // "RDS/current/RDS_modern.iso";
const string FILENAME = "READ_ME.TXT";
static async Task Main(string[] args)
{
var s3 = new AmazonS3Client();
using var stream = new Cppl.Utilities.AWS.SeekableS3Stream(s3, BUCKET, KEY, 1 * 1024 * 1024, 4);
using var iso = new CDReader(stream, true);
using var file = iso.OpenFile(FILENAME, FileMode.Open, FileAccess.Read);
using var reader = new StreamReader(file);
var content = await reader.ReadToEndAsync();
await Console.Out.WriteLineAsync($"{stream.TotalRead / (float)stream.Length * 100}% read, {stream.TotalLoaded / (float)stream.Length * 100}% loaded");
}
}
}

Categories