C# MongoDB Driver OutOfMemoryException - c#

I am trying to read data from a remote MongoDB instance from a c# console application but keep getting an OutOfMemoryException. The collection that I am trying to read data from has about 500,000 records. Does anyone see any issue with the code below:
var mongoCred = MongoCredential.CreateMongoCRCredential("xdb", "x", "x");
var mongoClientSettings = new MongoClientSettings
{
Credentials = new[] { mongoCred },
Server = new MongoServerAddress("x-x.mongolab.com", 12345),
};
var mongoClient = new MongoClient(mongoClientSettings);
var mongoDb = mongoClient.GetDatabase("xdb");
var mongoCol = mongoDb.GetCollection<BsonDocument>("Persons");
var list = await mongoCol.Find(new BsonDocument()).ToListAsync();

This is a simple workaround: you can page your results using .Limit(?int) and .Skip(?int); in totNum you have to store the documents number in your collection using
coll.Count(new BsonDocument) /*use the same filter you will apply in the next Find()*/
and then
for (int _i = 0; _i < totNum / 1000 + 1; _i++)
{
var result = coll.Find(new BsonDocument()).Limit(1000).Skip(_i * 1000).ToList();
foreach(var item in result)
{
/*Write your document in CSV file*/
}
}
I hope this can help...
P.S.
I used 1000 in .Skip() and .Limit() but, obviously, you can use what you want :-)

Related

Delete document from MongoDB by _id range in c#

My Mongo DB (using Azure Cosmos) reached max size of 20 GB. I didn't realize that, and now the app is not working. We were planning to delete old records (last 2 years). However, there is no date field in the document. I was thinking if _ts is maintained internally but looks like it is not. Then the only options is to use the _id (which is ObjectId). Can someone help on how to delete based on a date range using c#?
You can use getTimestamp method
In the shell, you can find creation date by this:
db.c.find()[0]["_id"].getTimestamp()
C# code will look similar to:
var client = new MongoClient();
var d = client.GetDatabase("d");
var c = d.GetCollection<BsonDocument>("c");
c.InsertOne(new BsonDocument());
var result = c.AsQueryable().First()["_id"].AsObjectId;
Console.WriteLine(result.CreationTime);
I managed to make it work using this:
var startId = new ObjectId(startDateTime, 0, 0, 0);
var endId = new ObjectId(endDateTime, 0, 0, 0);
using (var cursor = await collection.Find(x => x["_id"] > startId && x["_id"] < endId).ToCursorAsync())
{
while (await cursor.MoveNextAsync())
{
foreach (var doc in cursor.Current)
{
var creationTime = doc["_id"].AsObjectId.CreationTime;
var filter = Builders<BsonDocument>.Filter.Eq("_id", doc["_id"].AsObjectId);
try
{
var deleteResult = collection.DeleteMany(filter);
}
catch (Exception ex)
{
}
}
}
}

C# MVC Loop through list and update each record efficiently

I have a list of 'Sites' that are stored in my database. The list is VERY big and contains around 50,000+ records.
I am trying to loop through each record and update it. This takes ages, is there a better more efficient way of doing this?
using (IRISInSiteLiveEntities DB = new IRISInSiteLiveEntities())
{
var allsites = DB.Sites.ToList();
foreach( var sitedata in allsites)
{
var siterecord = DB.Sites.Find(sitedata.Id);
siterecord.CabinOOB = "Test";
siterecord.TowerOOB = "Test";
siterecord.ManagedOOB = "Test";
siterecord.IssueDescription = "Test";
siterecord.TargetResolutionDate = "Test";
DB.Entry(siterecord).State = EntityState.Modified;
}
DB.SaveChanges();
}
I have cut the stuff out of the code to get to the point. The proper function code I am using basically pulls a list out from Excel, then matches the records in the sites list and updates each record that matches accordingly. The DB.Find is slowing the loop down dramatically.
[HttpPost]
public ActionResult UploadUpdateOOBList()
{
CheckPermissions("UpdateOOBList");
string[] typesallowed = new string[] { ".xls", ".xlsx" };
HttpPostedFileBase file = Request.Files[0];
var fname = file.FileName;
if (!typesallowed.Any(fname.Contains))
{
return Json("NotAllowed");
}
file.SaveAs(Server.MapPath("~/Uploads/OOB List/") + fname);
//Create empty OOB data list
List<OOBList.OOBDetails> oob_data = new List<OOBList.OOBDetails>();
//Using ClosedXML rather than Interop Excel....
//Interop Excel: 30 seconds for 750 rows
//ClosedXML: 3 seconds for 750 rows
string fileName = Server.MapPath("~/Uploads/OOB List/") + fname;
using (var excelWorkbook = new XLWorkbook(fileName))
{
var nonEmptyDataRows = excelWorkbook.Worksheet(2).RowsUsed();
foreach (var dataRow in nonEmptyDataRows)
{
//for row number check
if (dataRow.RowNumber() >= 4 )
{
string siteno = dataRow.Cell(1).GetValue<string>();
string sitename = dataRow.Cell(2).GetValue<string>();
string description = dataRow.Cell(4).GetValue<string>();
string cabinoob = dataRow.Cell(5).GetValue<string>();
string toweroob = dataRow.Cell(6).GetValue<string>();
string manageoob = dataRow.Cell(7).GetValue<string>();
string resolutiondate = dataRow.Cell(8).GetValue<string>();
string resolutiondate_converted = resolutiondate.Substring(resolutiondate.Length - 9);
oob_data.Add(new OOBList.OOBDetails
{
SiteNo = siteno,
SiteName = sitename,
Description = description,
CabinOOB = cabinoob,
TowerOOB = toweroob,
ManageOOB = manageoob,
TargetResolutionDate = resolutiondate_converted
});
}
}
}
//Now delete file.
System.IO.File.Delete(Server.MapPath("~/Uploads/OOB List/") + fname);
Debug.Write("DOWNLOADING LIST ETC....\n");
using (IRISInSiteLiveEntities DB = new IRISInSiteLiveEntities())
{
var allsites = DB.Sites.ToList();
//Loop through sites and the OOB list and if they match then tell us
foreach( var oobdata in oob_data)
{
foreach( var sitedata in allsites)
{
var indexof = sitedata.SiteName.IndexOf(' ');
if( indexof > 0 )
{
var OOBNo = oobdata.SiteNo;
var OOBName = oobdata.SiteName;
var SiteNo = sitedata.SiteName;
var split = SiteNo.Substring(0, indexof);
if (OOBNo == split && SiteNo.Contains(OOBName) )
{
var siterecord = DB.Sites.Find(sitedata.Id);
siterecord.CabinOOB = oobdata.CabinOOB;
siterecord.TowerOOB = oobdata.TowerOOB;
siterecord.ManagedOOB = oobdata.ManageOOB;
siterecord.IssueDescription = oobdata.Description;
siterecord.TargetResolutionDate = oobdata.TargetResolutionDate;
DB.Entry(siterecord).State = EntityState.Modified;
Debug.Write("Updated Site ID/Name Record: " + sitedata.Id + "/" + sitedata.SiteName);
}
}
}
}
DB.SaveChanges();
}
var nowdate = DateTime.Now.ToString("dd/MM/yyyy");
System.IO.File.WriteAllText(Server.MapPath("~/Uploads/OOB List/lastupdated.txt"),nowdate);
return Json("Success");
}
Looks like you are using Entity Framework (6 or Core). In either case both
var siterecord = DB.Sites.Find(sitedata.Id);
and
DB.Entry(siterecord).State = EntityState.Modified;
are redundant, because the siteData variable is coming from
var allsites = DB.Sites.ToList();
This not only loads the whole Site table in memory, but also EF change tracker keeps reference to every object from that list. You can easily verify that with
var siterecord = DB.Sites.Find(sitedata.Id);
Debug.Assert(siterecord == sitedata);
The Find (when the data is already in memory) and Entry methods themselves are fast. But the problem is that they by default trigger automatic DetectChanges, which leads to quadratic time complexity - in simple words, very slow.
With that being said, simply remove them:
if (OOBNo == split && SiteNo.Contains(OOBName))
{
sitedata.CabinOOB = oobdata.CabinOOB;
sitedata.TowerOOB = oobdata.TowerOOB;
sitedata.ManagedOOB = oobdata.ManageOOB;
sitedata.IssueDescription = oobdata.Description;
sitedata.TargetResolutionDate = oobdata.TargetResolutionDate;
Debug.Write("Updated Site ID/Name Record: " + sitedata.Id + "/" + sitedata.SiteName);
}
This way EF will detect changes just once (before SaveChanges) and also will update only the modified record fields.
I have followed Ivan Stoev's suggestion and have changed the code by removing the DB.Find and the EntitySate Modified - It now takes about a minute and a half compared to 15 minutes beforehand. Very suprising as I didn't know that you dont actually require that to update the records. Clever. The code is now:
using (IRISInSiteLiveEntities DB = new IRISInSiteLiveEntities())
{
var allsites = DB.Sites.ToList();
Debug.Write("Starting Site Update loop...");
//Loop through sites and the OOB list and if they match then tell us
//750 records takes around 15-20 minutes.
foreach( var oobdata in oob_data)
{
foreach( var sitedata in allsites)
{
var indexof = sitedata.SiteName.IndexOf(' ');
if( indexof > 0 )
{
var OOBNo = oobdata.SiteNo;
var OOBName = oobdata.SiteName;
var SiteNo = sitedata.SiteName;
var split = SiteNo.Substring(0, indexof);
if (OOBNo == split && SiteNo.Contains(OOBName) )
{
sitedata.CabinOOB = oobdata.CabinOOB;
sitedata.TowerOOB = oobdata.TowerOOB;
sitedata.ManagedOOB = oobdata.ManageOOB;
sitedata.IssueDescription = oobdata.Description;
sitedata.TargetResolutionDate = oobdata.TargetResolutionDate;
Debug.Write("Thank you, next: " + sitedata.Id + "\n");
}
}
}
}
DB.SaveChanges();
}
So first of all you should turn your HTTPPost in an async function
more info https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/
What you then should do is create the tasks and add them to a list. Then wait for them to complete (if you want/need to) by calling Task.WaitAll()
https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitall?view=netframework-4.7.2
This will allow your code to run in parallel on multiple threads optimizing performance quite a bit already.
You can also use linq to for example reduce the size of allsites beforehand by doing something that will roughly look like this
var sitedataWithCorrectNames = allsites.Where(x => x //evaluate your condition here)
https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/ef/language-reference/supported-and-unsupported-linq-methods-linq-to-entities
and then start you foreach (var oobdata) with the now foreach(sitedate in sitedataWithCorrectNames)
Same goes for SiteNo.Contains(OOBName)
https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/getting-started-with-linq
P.S. Most db sdk's also provide asynchornous functions so use those aswell.
P.P.S. I didn't have an IDE so I eyeballed the code but the links should provide you with plenty of samples. Reply if you need more help.

Mongodb failed to insert all my records

I have just started using mongodb as a result of dealing with bulk data's for my new project.I just set up the database and installed c# driver for mongodb and here is what i tried
public IHttpActionResult insertSample()
{
var client = new MongoClient("mongodb://localhost:27017");
var database = client.GetDatabase("reznext");
var collection = database.GetCollection<BsonDocument>("sampledata");
List<BsonDocument> batch = new List<BsonDocument>();
for (int i = 0; i < 300000; i++)
{
batch.Add(
new BsonDocument {
{ "field1", 1 },
{ "field2", 2 },
{ "field3", 3 },
{ "field4", 4 }
});
}
collection.InsertManyAsync(batch);
return Json("OK");
}
But when i check the collection for documents i see only 42k out of 0.3million records inserted.I use robomongo as client and would like to know what is wrong here.Is there any insertion limit per operation ?
You write async and don't wait for a result. Either wait for it:
collection.InsertManyAsync(batch).Wait();
Or use synch call:
collection.InsertMany(batch);

Taking too long to insert records in Elasticsearch in c#

The following code pulls data from two tables table1 and table2, performs a JOIN on them, over field 3 and indexes it into Elasticsearch. The total number or rows which need indexing are around 500 million. The code inserts 5 million records in one hour, so this way it will take 100 hours to complete. Is there any way I can make it faster?
public static void selection()
{
Uri node = new Uri("http://localhost:9200");
ConnectionSettings settings = new ConnectionSettings(node);
ElasticClient client = new ElasticClient(settings);
int batchsize = 100;
string query = "select table1.field1, table2.field2 from table1 JOIN table2 ON table1.field3=table2.field3";
try
{
OracleCommand command = new OracleCommand(query, con);
OracleDataReader reader = command.ExecuteReader();
List<Record> l = new List<Record>(batchsize);
string[] str = new string[2];
int currentRow = 0;
while (reader.Read())
{
for (int i = 0; i < 2; i++)
str[i] = reader[i].ToString();
l.Add(new Record(str[0], str[1]));
if (++currentRow == batchsize)
{
Commit(l, client);
l.Clear();
currentRow = 0;
}
}
Commit(l, client);
}
catch(Exception er)
{
Console.WriteLine(er.Message);
}
}
public static void Commit(List<Record> l, ElasticClient client)
{
BulkDescriptor a = new BulkDescriptor();
foreach (var x in l)
a.Index<Record>(op => op.Object(x).Index("index").Type("type"));
var res = client.Bulk(d => a);
Console.WriteLine("100 records more inserted.");
}
Any help is appreciated! :)
Can you try using lower level client i.e. ElasticSearchClient ?
Here is sample example -
//Fill data in ElasticDataRows
StringBuilder ElasticDataRows = new StringBuilder()
ElasticDataRows.AppendLine("{ \"index\": { \"_index\": \"testindex\", \"_type\": \"Accounts\" }}");
ElasticDataRows.AppendLine(JsonConvert.SerializeXmlNode(objXML, Newtonsoft.Json.Formatting.None, true));
var node = new Uri(objExtSetting.SelectSingleNode("settings/ElasticSearchURL").InnerText);
var config = new ConnectionConfiguration(node);
ElasticsearchClient objElasticClient = new ElasticsearchClient(config);
//Insert data to ElasticSearch
var response = ExtractionContext.objElasticClient.Bulk(Message.ElasticDataRows.ToString());
ElasticSearchClient is not strongly typed like NEST. So you can convert your Class object data to JSON using NewtonSoft.JSON.
As per my testing this is more faster than NEST API.
Thanks,
Sameer
We have like 40-50 databases that we reindex each month. Each DB has from 1 to 8 mil rows. The difference is that i take the data from MongoDB. What i'm doing to make it faster is to use Parallel.Foreach with 32 threads running and inserting into elastic.I just insert one record because i need to calculate stuff for each of them but you just take them from the DB and insert them in elastic so the bulk insert seems better. You could try to use 3-4 threads and that bulk insert.
So split your table into 4 then start different threads that bulk insert into elastic. From what i have seen i'm pretty sure that the part when you read from DB is taking the biggest part of the time. Also i think you should try to use a batch > 100.

C# Lucene get all the index

I am working on a windows application using Lucene. I want to get all the indexed keywords and use them as a source for a auto-suggest on search field. How can I receive all the indexed keywords in Lucene? I am fairly new in C#. Code itself is appreciated. Thanks.
Are you looking extract all terms from the index?
private void GetIndexTerms(string indexFolder)
{
List<String> termlist = new ArrayList<String>();
IndexReader reader = IndexReader.open(indexFolder);
TermEnum terms = reader.terms();
while (terms.next())
{
Term term = terms.term();
String termText = term.text();
int frequency = reader.docFreq(term);
termlist.add(termText);
}
reader.close();
}
For inspiration with Apache Lucene.Net version 4.8 you can look at GitHub msigut/LuceneNet48Demo. Use classes: SearcherManager, *QueryParser and IndexWriter for build index.
// you favorite Query parser (MultiFieldQueryParser for example)
_queryParser = new MultiFieldQueryParser(...
// Execute the search with a fresh indexSearcher
_searchManager.MaybeRefreshBlocking();
var searcher = _searchManager.Acquire();
try
{
var q = _queryParser.Parse(query);
var topDocs = searcher.Search(q, 10);
foreach (var scoreDoc in topDocs.ScoreDocs)
{
var document = searcher.Doc(scoreDoc.Doc);
var hit = new QueryHit
{
Title = document.GetField("title")?.GetStringValue(),
// ... you logic to read data from index ...
};
}
}
finally
{
_searchManager.Release(searcher);
searcher = null;
}

Categories