I have encountering an error when inserting bulk data with the upsert function and cannot figure out how to fix it. Anyone know what is wrong here? What the program is essentially doing is grabbing data from a SQL server database and loading into our Couchbase bucket on an Amazon instance. It does initially begin loading but after about 10 or so upserts it then crashes.
My error is as follows:
Collection was modified; enumeration operation may not execute.
Here are the screen shots of the error (Sorry the error only is replicated on my other Amazon server instance and not locally):
http://imgur.com/a/ZJB0c
Here is the function which is calling the upsert method. This is called multiple times since I'm retrieving only parts of the data at a time since the SQL table is very large.
private void receiptItemInsert(double i, int k) {
const int BATCH_SIZE = 10000;
APSSEntities entity = new APSSEntities();
var data = entity.ReceiptItems.OrderBy(x => x.ID).Skip((int)i * BATCH_SIZE).Take(BATCH_SIZE);
var joinedData = from d in data
join s in entity.Stocks
on new { stkId = (Guid)d.StockID } equals new { stkId = s.ID } into ps
from s in ps.DefaultIfEmpty()
select new { d, s };
var stuff = joinedData.ToList();
var dict = new Dictionary<string, dynamic>();
foreach (var ri in stuff)
{
Stock stock = new Stock();
var ritem = new CouchModel.ReceiptItem(ri.d, k, ri.s);
string key = "receipt_item:" + k.ToString() + ":" + ri.d.ID.ToString();
dict.Add(key, ritem);
}
entity.Dispose();
using (var cluster = new Cluster(config))
{
//open buckets here
using (var bucket = cluster.OpenBucket("myhoney"))
{
bucket.Upsert(dict); #CRASHES HERE
}
}
}
as discussed in the Couchbase Forums, this is probably a bug in the SDK.
When initializing the internal map of the couchbase cluster, the SDK will construct a List of endpoints. If two+ threads (as is the case during a bulk upsert) trigger this code at the same time, one may see an instance of the List being populated by the other (because the lock is entered just after a call to List.Any(), which may crash if the list is being modified).
Related
I'm trying to replicate results from Gensim in C# to compare results and see if we need to bother trying to get Python to work within our broader C# context. I have been programming in C# for about a week, am usually a Python coder. I managed to get LDA to function and assign topics with C#, but there is no Catalyst model (that I could find) that does Doc2Vec explicitly, but rather I need to do something with FastText as they have in their sample code:
// Training a new FastText word2vec embedding model is as simple as this:
var nlp = await Pipeline.ForAsync(Language.English);
var ft = new FastText(Language.English, 0, "wiki-word2vec");
ft.Data.Type = FastText.ModelType.CBow;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(nlp.Process(GetDocs()));
ft.StoreAsync();
The claim is that it is simple, and fair enough... but what do I do with this? I am using my own data, a list of IDocuments, each with a label attached:
using (var csv = CsvDataReader.Create("Jira_Export_Combined.csv", new CsvDataReaderOptions
{
BufferSize = 0x20000
}))
{
while (await csv.ReadAsync())
{
var a = csv.GetString(1); // issue key
var b = csv.GetString(14); // the actual bug
// if (jira_base.Keys.ToList().Contains(a) == false)
if (jira.Keys.ToList().Contains(a) == false)
{ // not already in our dictionary... too many repeats
if (b.Contains("{panel"))
{
// get just the details/desc/etc
b = b.Substring(b.IndexOf("}") + 1, b.Length - b.IndexOf("}") - 1);
try { b = b.Substring(0, b.IndexOf("{panel}")); }
catch { }
}
b = b.Replace("\r\n", "");
jira.Add(a, nlp.ProcessSingle(new Document(b,Language.English)));
} // end if
} // end while loop
From a set of Jira Tasks and then I add labels:
foreach (KeyValuePair<string, IDocument> item in jira) { jira[item.Key].Labels.Add(item.Key); }
Then I add to a list (based on a breakdown from a topic model where I assign all docs that are at or above a threshold in that topic to the topic, jira_topics[n] where n is the topic numner, as such:
var training_lst = new List<IDocument>();
foreach (var doc in jira_topics[topic_num]) { training_lst.Add(jira[doc]); }
When I run the following code:
// FastText....
var ft = new FastText(Language.English, 0, $"vector-model-topic_{topic_num}");
ft.Data.Type = FastText.ModelType.Skipgram;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(training_lst);
var wtf = ft.PredictMax(training_lst[0]);
wtf is (null,NaN). [hence the name]
What am I missing? What else do I need to do to get Catalyst to vectorize my data? I want to grab the cosine similarities between the jira tasks and some other data I have, but I can't even get the Jira data into anything resembling a vectorization I can apply to something. Help!
Update:
So, Predict methods apparently only work for supervised learning in FastText (see comments below). And the following:
var wtf = ft.CompareDocuments(training_lst[0], training_lst[0]);
Throws an Implementation error (and only doesn't work with PVDM). How do I use PVDM, PVDCbow in Catalyst?
Scenario:
we have a table with partition and we have to update the partition query using .NET
We use the code below to replace the existing M Expression with a new one, but the changes are not updating at the database level (Analysis Tabular).
Are there any syntax errors or problems in this code?
TOA.Partition partition = m.Tables.Find(Table).Partitions[1];
OverrideCollection oc = new OverrideCollection
{
Partitions =
{
new PartitionOverride
{
OriginalObject =partition,
Source = new MPartitionSourceOverride
{
Expression=expressions
}
}
}
};
var listOc = new List<OverrideCollection>();
listOc.Add(oc);
partition.RequestRefresh(TOA.RefreshType.Add, listOc);
// m.Tables[Table].Partitions[1].Refresh(TOA.RefreshType.Full, listOc); //it is not working
db.Update(UpdateOptions.ExpandFull);
db.Model.SaveChanges();
m.SaveChanges();
TOA.Partition partition1 = m.Tables.Find(Table).Partitions[1];
I am wondering how to achieve over 20,0000 requests per second on Azure storage accounts. I understand that data needs to be separated into multiple storage accounts in order to surpass those limits however I am unable to achieve this with current code. I have achieved around 20,000 requests per second with an individual account but my performance doesn't improve (usually decreases) when adding multiple storage accounts.
Some info on how data is stored in the storage accounts and code in background:
1.there is one table per storage account
2.each table is partitioned by the first three hash of their id(I have played around with higher and lower)
3.each partition contains about 15 records
4.each storage account contains the exact same data (this is for test purposes)
5.there are currently 3 storage accounts
(50,000 records takes 2 minutes and 16 seconds to retrieve)
7.Im also using servicepointmanager with 100 default connection and naggle off.
Here is some sample code for a large query
public void retrievePartitionList<T1>(List<T1> entityList)
where T1 : ITableEntity, new()
{
int queryCountMax = 100; //Needed at 100 not to exceed uri Limits
var partitionGroup = entityList.GroupBy(m => m.PartitionKey);
List<TableQuery<T1>> queryList = new List<TableQuery<T1>>();
List<Task<TableQuerySegment<T1>>> taskList = new List<Task<TableQuerySegment<T1>>>();
//I have three storage accounts Im retrieving from. Ideally want 20k+ throughput for each storage account added
var cloudTable2 = getTableClientForTenant(2);
var cloudTable3 = getTableClientForTenant(3);
var tenTenTable2 = cloudTable2.GetTableReference(BATableStorageContainerPrefixes.tableStorage + 1.ToString());
var tenTable3 = cloudTable3.GetTableReference(BATableStorageContainerPrefixes.tableStorage + 1.ToString());
foreach (var partition in partitionGroup)
{
string rowFilters = "";
var partitionList = partition.ToList();
var partitionFilter = TableQuery.GenerateFilterCondition(TableConstants.PartitionKey, QueryComparisons.Equal, partition.Key);
for (var i = 0; i < partitionList.Count; i++)
{
var item = partitionList[i];
if (string.IsNullOrEmpty(rowFilters))
{
rowFilters = TableQuery.GenerateFilterCondition(TableConstants.RowKey, QueryComparisons.Equal, item.RowKey);
}
else
{
var newFilter = "(" + TableQuery.GenerateFilterCondition(TableConstants.RowKey, QueryComparisons.Equal, item.RowKey) + ")";
rowFilters += " or " + newFilter;
}
if ((i + 1) % queryCountMax == 0 || i == partitionList.Count - 1)
{
rowFilters = TableQuery.CombineFilters(partitionFilter, TableOperators.And, rowFilters);
TableQuery<T1> innerQuery = new TableQuery<T1>().Where(rowFilters);
innerQuery.TakeCount = TableConstants.TableServiceMaxResults;
queryList.Add(innerQuery);
var random = new Random();
//Randomly seperate task to different storage accounts
//Once again, each storage account contains the same complete data set so no matter where they go they should return the correct results
var tenantTask = tenantTable.ExecuteQuerySegmentedAsync(innerQuery, null);
var randomNum = random.Next(100);
if (randomNum < 33)
{
tenantTask = tenTenTable2.ExecuteQuerySegmentedAsync(innerQuery, null);
//Debug.WriteLine("second tenant");
}
else if(randomNum < 66)
{
//Debug.WriteLine("first tenant");
tenantTask = tenTable3.ExecuteQuerySegmentedAsync(innerQuery, null);
}
taskList.Add(tenantTask);
rowFilters = "";
}
}
}
List<T1> finalResults = new List<T1>();
//I have messed around with parallelism and 8 is usually the best for the machine I'm on
Parallel.ForEach(taskList, new ParallelOptions { MaxDegreeOfParallelism = 8 }, task =>
{
var results = Task.WhenAll(task).Result;
lock (finalResults)
{
foreach (var item in results)
{
finalResults.AddRange(item);
}
}
});
Debug.WriteLine(finalResults.Count()); //Just to show count of results received
}
So what I'm looking for is something that will add about 20,000 request throughput for each storage account added. I have tried running the on a S2 azure web app with 10 instances but came back with poor results. about 2 minutes and 16 seconds fro 50,000 records When knowing all the partition keys and rowids.
EDIT
To further explain the situation: The table entity being inserted is rather small. It only has a:
rowid is int
partition key is 3 char hash of the int
one property that is always the same a 10 digit int
1) You have not specified what the rest of the architecture is (# of clients, connection between the clients and the storage accounts). It’s highly likely you are running into other bottlenecks.
2) You have not specified a reason or a target for what are you trying to accomplish. Are you trying to get to 21K RPS or 2 million RPS?
3) Multiple storage accounts can end up on the same storage cluster, and eventually the storage clusters(s) involved will top out (but I expect the issue is more likely #1, a single storage cluster has around a thousand nodes, probably may not be using a thousand clients).
I have a query that looks like this:
using (MyDC TheDC = new MyDC())
{
foreach (MyObject TheObject in TheListOfMyObjects)
{
DBTable TheTable = new DBTable();
TheTable.Prop1 = TheObject.Prop1;
.....
TheDC.DBTables.InsertOnSubmit(TheTable);
}
TheDC.SubmitChanges();
}
This query basically inserts a list into the database using linq-to-sql. Now I've read online that L2S does NOT support bulk operations.
Does my query work by inserting each element at a time or all of them in one write?
Thanks for the clarification.
I modified the code from the following link to be more efficient and used it in my application. It is quite convenient because you can just put it in a partial class on top of your current autogenerated class. Instead of InsertOnSubmit add entities to a list, and instead of SubmitChanges call YourDataContext.BulkInsertAll(list).
http://www.codeproject.com/Tips/297582/Using-bulk-insert-with-your-linq-to-sql-datacontex
partial void OnCreated()
{
CommandTimeout = 5 * 60;
}
public void BulkInsertAll<T>(IEnumerable<T> entities)
{
using( var conn = new SqlConnection(Connection.ConnectionString))
{
conn.Open();
Type t = typeof(T);
var tableAttribute = (TableAttribute)t.GetCustomAttributes(
typeof(TableAttribute), false).Single();
var bulkCopy = new SqlBulkCopy(conn)
{
DestinationTableName = tableAttribute.Name
};
var properties = t.GetProperties().Where(EventTypeFilter).ToArray();
var table = new DataTable();
foreach (var property in properties)
{
Type propertyType = property.PropertyType;
if (propertyType.IsGenericType &&
propertyType.GetGenericTypeDefinition() == typeof(Nullable<>))
{
propertyType = Nullable.GetUnderlyingType(propertyType);
}
table.Columns.Add(new DataColumn(property.Name, propertyType));
}
foreach (var entity in entities)
{
table.Rows.Add(
properties.Select(
property => property.GetValue(entity, null) ?? DBNull.Value
).ToArray());
}
bulkCopy.WriteToServer(table);
}
}
private bool EventTypeFilter(System.Reflection.PropertyInfo p)
{
var attribute = Attribute.GetCustomAttribute(p,
typeof(AssociationAttribute)) as AssociationAttribute;
if (attribute == null) return true;
if (attribute.IsForeignKey == false) return true;
return false;
}
The term Bulk Insert usually refers to the SQL Server specific ultra fast bcp based SqlBulkCopy implementation. It is built on top of IRowsetFastLoad.
Linq-2-SQL does not implement insert using this mechanism, under any conditions.
If you need to bulk load data into SQL Server and need it to be fast, I would recommend hand coding using SqlBulkCopy.
Linq-2-SQL will attempt to perform some optimisations to speed up multiple inserts, however it still will fall short of many micro ORMs (even though no micro ORMs I know of implement SqlBulkCopy)
It will generate a single insert statement for every record, but will send them all to the server in a single batch and run in a single transaction.
That is what the SubmitChanges() outside the loop does.
If you moved it inside, then every iteration through the loop would go off to the server for the INSERT and run in it's own transaction.
I don't believe there is any way to fire off a SQL BULK INSERT.
LINQ Single Insert from List:
int i = 0;
foreach (IPAPM_SRVC_NTTN_NODE_MAP item in ipapmList)
{
++i;
if (i % 50 == 0)
{
ipdb.Dispose();
ipdb = null;
ipdb = new IPDB();
// .NET CORE
//ipdb.ChangeTracker.AutoDetectChangesEnabled = false;
ipdb.Configuration.AutoDetectChangesEnabled = false;
}
ipdb.IPAPM_SRVC_NTTN_NODE_MAP.Add(item);
ipdb.SaveChanges();
}
I would suggest you take a look at N.EntityFramework.Extension. It is a basic bulk extension framework for EF 6 that is available on Nuget and the source code is available on Github under MIT license.
Install-Package N.EntityFramework.Extensions
https://www.nuget.org/packages/N.EntityFramework.Extensions
Once you install it you can simply use BulkInsert() method directly on the DbContext instance. It support BulkDelete, BulkInsert, BulkMerge and more.
BulkInsert()
var dbcontext = new MyDbContext();
var orders = new List<Order>();
for(int i=0; i<10000; i++)
{
orders.Add(new Order { OrderDate = DateTime.UtcNow, TotalPrice = 2.99 });
}
dbcontext.BulkInsert(orders);
I'm using Rob Conery's Massive to connect to my database, but I don't seem to be able to be able to save a list of dynamic objects to the database. I thought this was supported though.
Here's the code I am attempting to use:
int numberOfChildren = int.Parse(Request.Form["numberOfChildren"]);
List<dynamic> children = new List<dynamic>();
for(int i = 1; i <= numberOfChildren; i++) {
dynamic child = new ExpandoObject();
child.FamilyID = familyId;
child.Type = "CHILD";
child.LastName = Request.Form[i + "-childLastName"];
child.FirstName = Request.Form[i + "-childFirstName"];
child.SendSmsAlerts = false;
child.Gender = Request.Form[i + "-childGender"];
child.Birthdate = Request.Form[i + "-childBirthdate"];
children.Add(child);
}
var people = new People();
people.Save(children);
I get a "Parameter count mismatch." error on line 78 of Massive.cs
Everything works fine if i only pass in a single dynamic object at a time, the error is only raised when I attempt to pass in the list. Based on the documentation on GitHub I thought this was supported and it would save all the children in one transaction.
Save takes an params array not a list.
people.Save(children.ToArray());