CSV does not load to Google BigQuery - c#

So I made a program that'll get the schema from a table in a MySQL Server, create a table based on the said schema, and insert the data rows which are saved in a CSV file in a Google Cloud Storage.
Code for getting the schema from MySQL:
foreach (string table in listOfTableNames)
{
MySqlCommand query = new MySqlCommand($"desc {table}", openedMySQLConnection);
MySqlDataReader reader = query.ExecuteReader();
DataTable dt = new DataTable(table);
dt.Load(reader);
reader.Close();
object[][] result = dt.AsEnumerable().Select(x => x.ItemArray).ToArray();
TableSchemas.Add(table, result);
}
Google BigQuery table maker from schema looped per table:
var schemaBuilder = new TableSchemaBuilder();
foreach (var column in dictionaryOfSchemas[myTablename])
{
string columnType = column[1].ToString().ToLower();
schemaBuilder.Add(
column[0].ToString(),
columnType.Contains("varchar") ? BigQueryDbType.String :
(columnType.Contains("int") ? BigQueryDbType.Int64 :
(columnType.Contains("decimal") ? BigQueryDbType.Float64 :
(columnType.Contains("timestamp") ? BigQueryDbType.DateTime :
(columnType.Contains("datetime") ? BigQueryDbType.DateTime :
(columnType.Contains("date") ? BigQueryDbType.Date :
BigQueryDbType.String)))))
);
}
TableSchema schema = schemaBuilder.Build();
BigQueryTable newTable = bigquery.GetDataset(myDataset).CreateTable(myTablename, schema);
CSV to created GBQ table looped per table:
bigquery.CreateLoadJob(
$"gs://{myProjectIdString}/{myBucketNameString}/{myCSVFilename}",
newTable.Reference,
schema,
new CreateLoadJobOptions()
{
SourceFormat = FileFormat.Csv,
SkipLeadingRows = 1
}
).PollUntilCompleted();
No error shows up after running the CreateLoadJob method.
Schema of the new table seems to be good and matches the CSV and the MySQL table:
Here's a sneak peak into the CSV file in Cloud Storage with some data redacted out:
But there's still no data in the table:
Am I doing something wrong? I'm still learning Google services so any help and insight would be appreciated. :)

There are a few things wrong here, but it's mostly to do with the data rather than with the code itself.
In terms of code, when a job has completed, it may still have completed with errors. You can call ThrowOnAnyError() to observe that. You can get detailed errors via job.Resource.Status.Errors. (I believe the first of those detailed errors is the one in job.Resource.Status.ErrorResults.)
Once the correct storage URL is provided (which would be observed that way as well) you'll see errors like this, with the CSV file you provided:
Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
Could not parse '06/12/2014' as DATE for field delivery_date (position 0) starting at location 95 with message 'Unable to parse'
At that point, the problem is in your CSV file. There are two issues here:
The date format is expected to be ISO-8601, e.g. "2014-06-12" rather than "06/12/2014" for example
The date/time format is expected to include seconds as well, so "2014-05-12 12:37:00" rather than "05/12/2014 12:37" for example
Hopefully you're able to run a preprocessing job to fix the data in your CSV file.
This is assuming that the schema you've created is correct, of course - we can't tell that from your post, but here's the schema that worked for me:
var schema = new TableSchemaBuilder
{
{ "delivery_date", BigQueryDbType.Date },
{ "delivery_hour", BigQueryDbType.Int64 },
{ "participant_id", BigQueryDbType.String },
{ "resource_id", BigQueryDbType.String },
{ "type_id", BigQueryDbType.String },
{ "price", BigQueryDbType.Numeric },
{ "date_posted", BigQueryDbType.DateTime },
{ "date_created", BigQueryDbType.DateTime }
}.Build();

After fiddling with my variables and links, I figured out that making a link to your bucket folder should be gs://<bucket_name>/<csv_filename> instead of gs://<project_id>/<bucket_name>/<csv_filename>.
My code's running ok for now and successfully transferred data.

Related

Sylvan CSV Reader C# Check for Missing Column in CSV

#MarkPflug I have a requirement to read 12 columns out of 45 - 85 total columns. This is from multiple csv files (in the hundreds). But here is the problem, a lot of the times a column or two will be missing from some csv data files. How do I check in C# for a missing column in a csv file given I use the nuget package sylvan csv reader. Here is some code:
// Create a reader
CsvDataReader reader = CsvDataReader.Create(file, new CsvDataReaderOptions { ResultSetMode = ResultSetMode.MultiResult });
// Get column by name from csv. This is where the error occurs only in the files that have missing columns. I store these and then use them in a GetString(Ordinal).
reader.GetOrdinal("HomeTeam");
reader.GetOrdinal("AwayTeam");
reader.GetOrdinal("Referee");
reader.GetOrdinal("FTHG");
reader.GetOrdinal("FTAG");
reader.GetOrdinal("Division");
// There is more data here, but anyway you get the point.
// Here I run the reader and for each piece of data I run my database write method.
while (await reader.ReadAsync())
{
await AddEntry(idCounter.ToString(), idCounter.ToString(), attendance, referee, division, date, home_team, away_team, fthg, ftag, hthg, htag, ftr, htr);
}
I tried the following:
// This still causes it to go out of bounds.
if(reader.GetOrdinal("Division") < reader.FieldCount)
// only if the ordinal exists then assign it in a temp variable
else
// skip this column (set the data in add entry method to "")
Looking at the source, it appears that GetOrdinal throws if the column name isn't found or is ambiguous. As such I expect you could do:
int blah1Ord = -1;
try{ blah1Ord = reader.GetOrdinal("blah1"); } catch { }
int blah2Ord = -1;
try{ blah2Ord = reader.GetOrdinal("blah2"); } catch { }
while (await reader.ReadAsync())
{
var x = new Whatever();
if(blah1Ord > -1) x.Blah1 = reader.GetString(blah1Ord);
if(blah2Ord > -1) x.Blah2 = reader.GetString(blah2Ord);
}
And so on, so you effectively sound out whether a column exists - the ordinal remains -1 if it doesn't - and then use that to decide whether to read the column or not
Incidentally, I've been dealing with CSVs with poor/misspelled/partial header names, and I've found myself getting the column schema and searching it for partials, like:
using var cdr = CsvDataReader.Create(sr);
var cs = await cdr.GetColumnSchemaAsync();
var sc = StringComparison.OrdinalIgnoreCase;
var blah1Ord = cs.FirstOrDefault(c => c.ColumnName.Contains("blah1", sc))?.ColumnOrdinal ?? -1;
I started using the Sylvan library and it is really powerful.
Not sure if this could help you but if you use the DataBinder.Create<T> generic method from an entity, you can do the following to get columns in your CSV file that do not map to any of the entity properties:
var dataBinderOptions = new DataBinderOptions()
{
// AllColumns is required to throw UnboundMemberException
BindingMode = DataBindingMode.AllColumns,
};
IDataBinder<TEntity> binder;
try
{
binder = DataBinder.Create<TEntity>(dataReader, dataBinderOptions);
}
catch (UnboundMemberException ex)
{
// Use ex.UnboundColumns to get unmapped columnns
readResult.ValidationProblems.Add($"Unmapped columns: {String.Join(", ", ex.UnboundColumns)}");
return;
}

Performance: Deserializing a JSON array or Storing a Const[] with data on c#?

Im storing a huge amount of data in JSON format that needs to be committed to DB.
This JSON data is being deserialized into a C# class[].
Thing is there are other seed data thats stored in a static readonly Class[] and simply being loaded/sent to DB.
My question is which is better assuming we have 20k records for an entity. Having a JSON and deserializing or having a huge static readonly class[]?
My objective is to have a single uniform method of storing data instead of different types of files for the same thing.
Edit: My question apparently wasn't well explained so Ill use an example this time. I have 2 options to load data:
a Json:
[
{
Id: "56d0bdbe-25be-4ea8-a422-dc302deee962",
Name: "C-1"
LegacyId: 1
},
{
Id: "7bf2e997-8a8b-43c9-ba08-1770cd3adb38",
Name: "C-2"
LegacyId: 2
}
]
Each item is being deserialized into a UserEntity which is then later stored in db.
However there are other files that are being used for the same purpose of loading data to db. But in this case they are stored like this:
public static class UserSeedData
{
public static readonly UserEntity[] UserData = {
new UserEntity{
Id = "56d0bdbe-25be-4ea8-a422-dc302deee962",
Name = "C-1",
LegacyId = 1
},
new UserEntity{
Id = "7bf2e997-8a8b-43c9-ba08-1770cd3adb38",
Name = "C-2",
LegacyId = 2
}
};
}
My question is which is faster or if theres a difference at all between loading and serializing the JSON // loading the static class UserSeedData to insert it on DB.
Im focusing on the deserialization part and not the committing to DB. This is a legacy system which is currently using BOTH ways to load data. Im simply trying to normalize the process by defining which method is better.
In the end the code to commit to DB is either:
From JSON:
List<UserEntity> users = JsonConvert.DeserializeObject<List<UserEntity>>(File.ReadAllText("Users.json"));
foreach (var userData in users)
{
_db.Add(userData);
}
Or from C# class:
foreach (var userData in UserSeedData.UserData)
{
_db.Add(userData);
}

EF Core Migration - PK Violation when seeding data

I am working with an existing project that has a database with what appears to be manually created data via SQL Server \ SSMS.
Further down the project someone else has come and created a seed data \ configuration file. This is where I have been introduced into the solution and have created a new migration file, and found that I am getting an error:
PRIMARY KEY constraint 'PK_AnswerTypes'. Cannot insert duplicate key in object 'forms.AnswerTypes'. The duplicate key value is (1)
Looking through Azure Pipelines, this appears to have been an issue since the configuration file was created.
The configure code is
public void Configure(EntityTypeBuilder<FieldType> builder)
{
if (builder == null)
{
throw new ArgumentNullException(nameof(builder));
}
builder.ToTable("FieldTypes", FormEngineSchemas.Forms);
// TODO: convert to enum
builder.HasData(
new FieldType
{
FieldTypeId = 1,
FieldTypes = "NUMBER"
},
new FieldType
{
FieldTypeId = 2,
FieldTypes = "DROPDOWN"
},
new FieldType
{
FieldTypeId = 3,
FieldTypes = "DATE"
});
}
The upscript is
protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.InsertData(
schema: "forms",
table: "AnswerTypes",
columns: new[] { "AnswerTypeId", "AnswerTypes" },
values: new object[,]
{
{ 1, "Range" },
{ 2, "Length" },
{ 3, "regex" }
});
}
I would be grateful if someone could help advise me how to get passed this as I am looking to not have to delete the existing data in the database because I dont want to risk potential orphaned records, or risk failed deletes.
I have had a look round and this is the closest that I can see to my issue
https://github.com/dotnet/efcore/issues/12324
Looking here it looks like the seeding has been done correctly
https://www.learnentityframeworkcore.com/migrations/seeding
https://learn.microsoft.com/en-us/archive/msdn-magazine/2018/august/data-points-deep-dive-into-ef-core-hasdata-seeding
So, questions I have are;
If the database was all created and seeded from the beginning and all worked fine would all subsequent migrations work ok and not attempt to seed them again.
What is the best way to to get around this issue.
Is there anything I might have missed or not considered?
Thanks
Simon
Since you do not want to lose data, you can consider using migrationBuilder.UpdateData() instead of migrationBuilder.InsertData(). The InserData method will add new records to the DB while UpdateData will search for and update existing records in the DB.
Make sure you have EF Core v2.1 and up for this to work.

MongoDB C# Driver how to update a collection of updated documents

I have a Mongo database with lots of documents. All of the documents have a field called "source" which contains the origin name for the document. But a lot of old documents have this field containing "null" (because I hadn't have source by that time) string value. I want to select all those documents and fix this problem by replacing their "source = "null"" values by new values parsed from another fields of the aforementioned documents.
Here's what I'm doing to fix the this:
public void FixAllSources() {
Task.Run(async ()=> {
var filter = Builders<BsonDocument>.Filter.And(new List<FilterDefinition<BsonDocument>>() {
Builders<BsonDocument>.Filter.Exists("url"),
Builders<BsonDocument>.Filter.Ne("url", BsonNull.Value),
Builders<BsonDocument>.Filter.Eq("source", "null")
});
var result = await m_NewsCollection.FindAsync(filter);
var list = result.ToList();
list.ForEach(bson => {
bson["source"] = Utils.ConvertUrlToSource(bson["url"].AsString);
});
});
}
as you can see, I'm fetching all those documents and replacing their source field by a new value. So now I've got a List of correct BsonDocuments which I need to put back into the database.
Unfortunately, I'm stuck on this. The problem is that all of the "Update" methods require filters and I have no idea what filters should I pass there. I just need to put back all those updated documents, and that's it.
I'd appreciate any help :)
p.s.
I've came up with an ugly solution like this:
list.ForEach(bson => {
bson["source"] = Utils.ConvertUrlToSource(bson["url"].AsString);
try {
m_NewsCollection.UpdateOne( new BsonDocument("unixtime", bson["unixtime"]), new BsonDocument {{"$set", bson}},
new UpdateOptions {IsUpsert = true});
}
catch (Exception e) {
WriteLine(e.StackTrace);
}
});
It works for now. But I'd still like to know the better one in case I need to do something like this again. So I'm not putting my own solution as an answer

Parse & Unity 3D : Update an existing row

Using the example code from the Unity Developer Guide | Parse
# https://www.parse.com/docs/unity_guide#objects-updating
// Create the object.
var gameScore = new ParseObject("GameScore")
{
{ "score", 1337 },
{ "playerName", "Sean Plott" },
{ "cheatMode", false },
{ "skills", new List<string> { "pwnage", "flying" } },
};
gameScore.SaveAsync().ContinueWith(t =>
{
// Now let's update it with some new data. In this case, only cheatMode
// and score will get sent to the cloud. playerName hasn't changed.
gameScore["cheatMode"] = true;
It just adds a new row and leaves the original row unchanged.
I guess i'm thinking Parse would do something "SQL like" such as UPDATE where primaryKey = 123.
Searching for an answer i found this code #
https://parse.com/questions/updating-a-field-without-retrieving-the-object-first, but there was no example in C#. All attempts to port this to C# result in multiple syntax errors.
UnityScript:
// Create a pointer to an object of class Point with id dlkj83d
var Point = Parse.Object.extend("Point");
var point = new Point();
point.id = "dlkj83d";
// Set a new value on quantity
point.set("quantity", 6);
// Save
point.save(null, {
success: function(point) {
// Saved successfully.
},
error: function(point, error) {
// The save failed.
// error is a Parse.Error with an error code and description.
}
});
Does Parse have some way to update a row that already exists using C#? And where is it in the docs? And how can their own example be so useless?
One of the posts related to my question stated "retrieve the object, then write it back with the changes" and i had not the faintest idea how to execute the stated objective (especially after the epic fail of Parse Documentation's example code)
Here is what i have been able to figure out and make work:
var query = new ParseQuery<ParseObject>("Tokens")
.WhereEqualTo ("objectId", "XC18riofu9");
query.FindAsync().ContinueWith(t =>
{
var tokens = t.Result;
IEnumerator<ParseObject> enumerator = tokens.GetEnumerator();
enumerator.MoveNext();
var token = enumerator.Current;
token["power"] = 20;
return token.SaveAsync();
}).Unwrap().ContinueWith(t =>
{
// Everything is done!
//Debug.Log("Token has been updated!");
});
the first part retrieves the object with the stated objectId, the second part sets the fields in the object. The third part reports all is well with the operation.
it's a monkey see, monkey do understanding at this point being that i do not understand the finer points in the code.
the code can be tested by creating a class named "Tokens". in that class create a tokenName field and a power field. make a few rows with Fire, water, mud as the tokenNames. Replace the objectId in the .WhereEqualTo clause with a valid objectId or any other search parameters you like. Execute the code and observe the changes in the Parse Data Browser.
For extra credit create the class required to implement the example code from the Chaining Tasks Together section of Parse's Documentation.
https://www.parse.com/docs/unity_guide#tasks-chaining

Categories