I'm generating a report from largeish (2million+ records) data in a MongoDB instance using the C# MongoDB driver. Getting all the records and processing them serverside is slow so I've been trying different things.
The input is a List of arbitrary length what the code then has to do is query a largeish (2million record) collection for records that contain the Guids input.
INPUT Dataset
{A, B, C} {1-A, 2-A, 3-A, 4-C, 5-B, 6-C, 7-Z, 8-B .... 1000-Z}
A - 1-A, 2-A, 3-A = Count = 3
B - 5-B, 8-B = Count = 2
C - 4-C, 6-C = Count = 2
And then I need to return the set of matched records in the dataset.
The logic is sound and I've implemented it as a Linq query which runs pretty well but at just over 30s is too slow to be on the end of an API call so I'm trying to optimise it.
It seems that MongoDB is actually pretty quick at returning data so I thought I would divide up the guids into sets of x length and parallel-ise the routine:
var results = new List<Instance>();
int counter = 0; int chunksize = 50;
Parallel.For(0, (inputs.Count() / chunksize) + 1, x =>
{
var cx = inputs.Skip(chunksize * counter).Take(chunksize);
foreach (var c in cx)
{
checkCounter++;
$"Processing {c}".Dump();
var instances = _db.Get<Instance>().Where(_Predicate_);
if (instances.Any())
{
results.AddRange(instances);
$"Total Instances is now: {results.Count()}".Dump();
}
}
});
It seems counter-intuitive (to me as a longtime SQL user) but I think its got legs. The problem is when the code runs multiple threads seem to be grabbing the same guids from the list here:
var cx = inputs.Skip(chunksize * counter).Take(chunksize);
and of course I need to ensure I'm giving each thread a unique set of guids. Can I do that in a parallel.for or should I be looking at doing something more low-level e.g. generating separate tasks?
Thanks for reading.
You should use x instead of counter in your loop:
var cx = inputs.Skip(chunksize * x).Take(chunksize);
Also use one of thread-safe collections for results, or refactor your code so you generate batches and after that you process them in parallel.
You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
IObservable<List<Instance>> query =
from x in Observable.Range(0, (inputs.Count() / chunksize) + 1)
from c in inputs.Skip(chunksize * x).Take(chunksize).ToObservable()
from i in Observable.Start(() => _db.Get<Instance>().Where(_Predicate_).ToList())
select i;
IList<List<Instance>> data = await query.ToList();
List<Instance> results = data.SelectMany(x => x).ToList();
Note the use of x within from c in inputs.Skip(chunksize * x).Take(chunksize).ToObservable(). That's where your original code went wrong in using counter.
This code is run in parallel and will automatically build the final list without any concern about threading on List<Instance>.
The use of LINQ makes the code quite readable too.
Related
I am connecting to a web service that gives me all prices for a day (without time info). Each of those price results has the id for a corresponding "batch run".
The "batch run" has a date+time stamp, but I have to make a separate call to get all the batch info for the day.
Hence, to get the actual time of each result, I need to combine the two API calls.
I'm using Reactive for this, but I can't reliably combine the two sets of data. I thought that CombineLatest would do it, but it doesn't seem to work as I thought (based on http://reactivex.io/documentation/operators/combinelatest.html, http://introtorx.com/Content/v1.0.10621.0/12_CombiningSequences.html#CombineLatest).
[TestMethod]
public async Task EvenMoreBasicCombineLatestTest()
{
int batchStart = 100, batchCount = 10;
//create 10 results with batch ids [100, 109]
//the test uses lists just to make debugging easier
var resultsWithBatchIdList = Enumerable.Range(batchStart, batchCount)
.Select(id => new { BatchRunId = id, ResultValue = id * 10 })
.ToList();
var resultsWithBatchId = Observable.ToObservable(resultsWithBatchIdList);
Assert.AreEqual(batchCount, await resultsWithBatchId.Count());
//create 10 batches with ids [100, 109]
var batchesList = Enumerable.Range(batchStart, batchCount)
.Select(id => new
{
ThisId = id,
BatchName = String.Concat("abcd", id)
})
.ToList();
var batchesObservable = Observable.ToObservable(batchesList);
Assert.AreEqual(batchCount, await batchesObservable.Count());
//turn the batch set into a dictionary so we can look up each batch by its id
var batchRunsByIdObservable = batchesObservable.ToDictionary(batch => batch.ThisId);
//for each result, look up the corresponding batch id in the dictionary to join them together
var resultsWithCorrespondingBatch =
batchRunsByIdObservable
.CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
{
Assert.AreEqual(NumberOfResultsToCreate, batchRunsById.Count);
var correspondingBatch = batchRunsById[result.BatchRunId];
var priceResultAndSourceBatch = new
{
Result = result,
SourceBatchRun = correspondingBatch
};
return priceResultAndSourceBatch;
});
Assert.AreEqual(batchCount, await resultsWithCorrespondingBatch.Count());
}
I would expect as each element of the 'results' observable comes through, it would get combined with each element of the batch-id dictionary observable (which only ever has one element). But instead, it looks like only the last element of the result list gets joined.
I have a more complex problem deriving from this but while trying to create a minimum repro, even this is giving me unexpected results. This happens with version 3.1.1, 4.0.0, 4.2.0, etc.
(Note that the sequences don't generally match up as in this artificial example, so I can't just Zip them.)
So how can I do this join? A stream of results that I want to look up more info via a Dictionary (which also is coming from an Observable)?
Also note that the goal is to return the IObservable (resultsWithCorrespondingBatch), so I can't just await the batchRunsByIdObservable.
Ok I think I figured it out. I wish either of the two marble diagrams in the documentation had been just slightly different -- it would have made a subtlety of CombineLatest much more obvious:
N------1---2---3---
L--z--a------bc----
R------1---2-223---
a a bcc
It's combine latest -- so depending on when items get emitted, it's possible to miss some tuples. What I should have done is SelectMany:
NO: .CombineLatest(resultsWithBatchId, (batchRunsById, result) =>
YES: .SelectMany(batchRunsById => resultsWithBatchId.Select(result =>
Note that the "join" order is important: A.SelectMany(B) vs B.SelectMany(A) -- if A has 1 item and B has 100 items, the latter would result in 100 calls to subscribe to A.
I'm not sure External Source is the correct phrasing, but essentially I have a view in my database that points to a table in a different database. Not always, but from time to time I get an ORA-12537 Network Session: End of File exception. I'm using Entity Framework, so I tried breaking it up so instead of using one massive query, it does a handful of queries to generate the final result. But this has had a mixed-to-no impact.
public List<SomeDataModel> GetDataFromList(List<string> SOME_LIST_OF_STRINGS)
{
var retData = new List<SomeDataModel>();
const int MAX_CHUNK_SIZE = 1000;
var totalPages = (int)Math.Ceiling((decimal)SOME_LIST_OF_STRINGS.Count / MAX_CHUNK_SIZE);
var pageList = new List<List<string>>();
for(var i = 0; i < totalPages; i++)
{
var chunkItems = SOME_LIST_OF_STRINGS.Skip(i * MAX_CHUNK_SIZE).Take(MAX_CHUNK_SIZE).ToList();
pageList.Add(chunkItems);
}
using (var context = new SOMEContext())
{
foreach(var pageChunk in pageList)
{
var result = (from r in context.SomeEntity
where SOME_LIST_OF_STRINGS.Contains(r.SomeString)
select r).ToList();
result.ForEach(x => retData.Add(mapper.Map<SomeDataModel>(x)));
}
}
return retData;
}
I'm not sure if there's a different approach to dealing with this exception or not, or if breaking up the query has any desired effect. It's probably worth noting that SOME_LIST_OF_STRINGS is pretty large (about 21,000 on average), so totalPages usually sits around 22.
Sometimes, that error can be caused by an excessively large "IN" list in the SQL. For example:
SELECT *
FROM tbl
WHERE somecol IN ( ...huge list of stuff... );
Enabling application or database level tracing could help reveal whether the SQL that's being constructed behind the scenes has a large IN list.
A workaround might be to INSERT "...huge list of stuff..." into a table and then use something similar to the query below in order to avoid the huge list of literals.
SELECT *
FROM tbl
WHERE somecol IN ( select stuff from sometable );
Reference*:
https://support.oracle.com/knowledge/More%20Applications%20and%20Technologies/2226769_1.html
*I mostly drew my conclusions from the part of this reference that's not publicly viewable.
Currently I am using the default configuration value of:
<setting name="ContentSearch.SearchMaxResults" value="500" />
I need, for a specific Solr (ContentSearch) query, to return all items of a specific Template ID. The total returned will be in excess of 1200 items.
I tried using the paging feature to override SearchMaxResults by invoking a query as follows:
var query = context.GetQueryable<SearchResultItem>().Filter(i => i["_template"].Equals(variantTemplateId));
query = query.Page(1, 1500);
var results = query.GetResults();
However, I still only receive a single page of 500 items as the 1500 Page Size won't override the SearchMaxResults value of 500.
I really don't want to increase SearchMaxResults for all queries as it's going to have a negative impact overall on search. It would be ideal if I could set this parameter to "" (unlimited results) temporarily, run my query, and reset it back to default - but I don't see a way to be able to do this. I also cannot use GetDescendants() as a mean of acquiring all these items as it negatively impacts site performance, even if I only do it one time and store my results in Memory Cache.
Any direction would be greatly appreciated.
As you say, it's good to keep the SearchMaxResults to a reosanble low number, such as 500. When you know you might need to fetch more data, you can perform several queries in a loop, for example like this:
int skip = 0;
const int chunkSize = 500;
bool fetchMore = true;
while (fetchMore) {
var q = context.GetQueryable<MyModel>()
.Filter(....)
...
.Skip(skip).Take(chunkSize)
.Select (d => new { d.field1, d.field2, ... })
.GetResults();
var cnt = 0;
foreach (var doc in q.Hits) {
// do stuff
cnt ++;
}
skip += cnt;
fetchMore = cnt == chunkSize;
}
As slightly noted above, I've used the Select method to limit the number of fields returned. This will specify the fl Solr field to return just the fields you need. Otherwise fl=*,score will be used and cause a lot of data to be sent over the network and deserializing it can be quite heavy. (I have a separate post on this here: https://mikael.com/2019/01/optimize-sitecore-solr-queries/)
I know variants of this question have been asked before (even by me), but I still don't understand a thing or two about this...
It was my understanding that one could retrieve more documents than the 128 default setting by doing this:
session.Advanced.MaxNumberOfRequestsPerSession = int.MaxValue;
And I've learned that a WHERE clause should be an ExpressionTree instead of a Func, so that it's treated as Queryable instead of Enumerable. So I thought this should work:
public static List<T> GetObjectList<T>(Expression<Func<T, bool>> whereClause)
{
using (IDocumentSession session = GetRavenSession())
{
return session.Query<T>().Where(whereClause).ToList();
}
}
However, that only returns 128 documents. Why?
Note, here is the code that calls the above method:
RavenDataAccessComponent.GetObjectList<Ccm>(x => x.TimeStamp > lastReadTime);
If I add Take(n), then I can get as many documents as I like. For example, this returns 200 documents:
return session.Query<T>().Where(whereClause).Take(200).ToList();
Based on all of this, it would seem that the appropriate way to retrieve thousands of documents is to set MaxNumberOfRequestsPerSession and use Take() in the query. Is that right? If not, how should it be done?
For my app, I need to retrieve thousands of documents (that have very little data in them). We keep these documents in memory and used as the data source for charts.
** EDIT **
I tried using int.MaxValue in my Take():
return session.Query<T>().Where(whereClause).Take(int.MaxValue).ToList();
And that returns 1024. Argh. How do I get more than 1024?
** EDIT 2 - Sample document showing data **
{
"Header_ID": 3525880,
"Sub_ID": "120403261139",
"TimeStamp": "2012-04-05T15:14:13.9870000",
"Equipment_ID": "PBG11A-CCM",
"AverageAbsorber1": "284.451",
"AverageAbsorber2": "108.442",
"AverageAbsorber3": "886.523",
"AverageAbsorber4": "176.773"
}
It is worth noting that since version 2.5, RavenDB has an "unbounded results API" to allow streaming. The example from the docs shows how to use this:
var query = session.Query<User>("Users/ByActive").Where(x => x.Active);
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
User activeUser = enumerator.Current.Document;
}
}
There is support for standard RavenDB queries, Lucence queries and there is also async support.
The documentation can be found here. Ayende's introductory blog article can be found here.
The Take(n) function will only give you up to 1024 by default. However, you can change this default in Raven.Server.exe.config:
<add key="Raven/MaxPageSize" value="5000"/>
For more info, see: http://ravendb.net/docs/intro/safe-by-default
The Take(n) function will only give you up to 1024 by default. However, you can use it in pair with Skip(n) to get all
var points = new List<T>();
var nextGroupOfPoints = new List<T>();
const int ElementTakeCount = 1024;
int i = 0;
int skipResults = 0;
do
{
nextGroupOfPoints = session.Query<T>().Statistics(out stats).Where(whereClause).Skip(i * ElementTakeCount + skipResults).Take(ElementTakeCount).ToList();
i++;
skipResults += stats.SkippedResults;
points = points.Concat(nextGroupOfPoints).ToList();
}
while (nextGroupOfPoints.Count == ElementTakeCount);
return points;
RavenDB Paging
Number of request per session is a separate concept then number of documents retrieved per call. Sessions are short lived and are expected to have few calls issued over them.
If you are getting more then 10 of anything from the store (even less then default 128) for human consumption then something is wrong or your problem is requiring different thinking then truck load of documents coming from the data store.
RavenDB indexing is quite sophisticated. Good article about indexing here and facets here.
If you have need to perform data aggregation, create map/reduce index which results in aggregated data e.g.:
Index:
from post in docs.Posts
select new { post.Author, Count = 1 }
from result in results
group result by result.Author into g
select new
{
Author = g.Key,
Count = g.Sum(x=>x.Count)
}
Query:
session.Query<AuthorPostStats>("Posts/ByUser/Count")(x=>x.Author)();
You can also use a predefined index with the Stream method. You may use a Where clause on indexed fields.
var query = session.Query<User, MyUserIndex>();
var query = session.Query<User, MyUserIndex>().Where(x => !x.IsDeleted);
using (var enumerator = session.Advanced.Stream<User>(query))
{
while (enumerator.MoveNext())
{
var user = enumerator.Current.Document;
// do something
}
}
Example index:
public class MyUserIndex: AbstractIndexCreationTask<User>
{
public MyUserIndex()
{
this.Map = users =>
from u in users
select new
{
u.IsDeleted,
u.Username,
};
}
}
Documentation: What are indexes?
Session : Querying : How to stream query results?
Important note: the Stream method will NOT track objects. If you change objects obtained from this method, SaveChanges() will not be aware of any change.
Other note: you may get the following exception if you do not specify the index to use.
InvalidOperationException: StreamQuery does not support querying dynamic indexes. It is designed to be used with large data-sets and is unlikely to return all data-set after 15 sec of indexing, like Query() does.
Iterating through a datatable that contains about 40 000 records using for-loop takes almost 4 minutes. Inside the loop I'm just reading the value of a specific column of each row and concatinating it to a string.
I'm not opening any DB connections or something, as its a function which recieves a datatable, iterate through it and returns a string.
Is there any faster way of doing this?
Code goes here:
private string getListOfFileNames(Datatable listWithFileNames)
{
string whereClause = "";
if (listWithFileNames.Columns.Contains("Filename"))
{
whereClause = "where filename in (";
for (int j = 0; j < listWithFileNames.Rows.Count; j++)
whereClause += " '" + listWithFileNames.Rows[j]["Filename"].ToString() + "',";
}
whereClause = whereClause.Remove(whereClause.Length - 1, 1);
whereClause += ")";
return whereClause;
}
Are you using a StringBuilder to concat the strings rather than just regular string concatenation?
Are you pulling back any more columns from the database then you really need to? If so, try not to. Only pull back the column(s) that you need.
Are you pulling back any more rows from the database then you really need to? If so, try not to. Only pull back the row(s) that you need.
How much memory does the computer have? Is it maxing out when you run the program or getting close to it? Is the processor at the max much or at all? If you're using too much memory then you may need to do more streaming. This means not pulling the whole result set into memory (i.e. a datatable) but reading each line one at a time. It also might mean that rather than concatting the results into a string (or StringBuilder ) that you might need to be appending them to a file so as to not take up so much memory.
Following linq statement have a where clause on first column and concat the third column in a variable.
string CSVValues = String.Join(",", dtOutput.AsEnumerable()
.Where(a => a[0].ToString() == value)
.Select(b => b[2].ToString()));
Step 1 - run it through a profiler, make sure you're looking at the right thing when optimizing.
Case in point, we had an issue we were sure was slow database interactions and when we ran the profiler the db barely showed up.
That said, possible things to try:
if you have the memory available, convert the query to a list, this
will force a full db read. Otherwise the linq will probably load in
chunks doing multiple db queries.
push the work to the db - if you can create a query than trims down
the data you are looking at, or even calculates the string for you,
that might be faster
if this is something where the query is run often but the data rarely
changes, consider copying the data to a local db (eg. sqlite) if
you're using a remote db.
if you're using the local sql-server, try sqlite, it's faster for
many things.
var value = dataTable
.AsEnumerable()
.Select(row => row.Field<string>("columnName"));
var colValueStr = string.join(",", value.ToArray());
Try adding a dummy column in your table with an expression. Something like this:
DataColumn dynColumn = new DataColumn();
{
dynColumn.ColumnName = "FullName";
dynColumn.DataType = System.Type.GetType("System.String");
dynColumn.Expression = "LastName+' '-ABC";
}
UserDataSet.Tables(0).Columns.Add(dynColumn);
Later in your code you can use this dummy column instead. You don't need to rotate any loop to concatenate a string.
Try using parallel for loop..
Here's the sample code..
Parallel.ForEach(dataTable.AsEnumerable(),
item => { str += ((item as DataRow)["ColumnName"]).ToString(); });
I've separated the job in small pieces and let each piece be handled by its own Thread. You can fine tune the number of thread by varying the nthreads number. Try it with different numbers so you can see the difference in performance.
private string getListOfFileNames(DataTable listWithFileNames)
{
string whereClause = String.Empty;
if (listWithFileNames.Columns.Contains("Filename"))
{
int nthreads = 8; // You can play with this parameter to fine tune and get your best time.
int load = listWithFileNames.Rows.Count / nthreads; // This will tell how many items reach thread mush process.
List<ManualResetEvent> mres = new List<ManualResetEvent>(); // This guys will help the method to know when the work is done.
List<StringBuilder> sbuilders = new List<StringBuilder>(); // This will be used to concatenate each bis string.
for (int i = 0; i < nthreads; i++)
{
sbuilders.Add(new StringBuilder()); // Create a new string builder
mres.Add(new ManualResetEvent(false)); // Create a not singaled ManualResetEvent.
if (i == 0) // We know were to put the very begining of your where clause
{
sbuilders[0].Append("where filename in (");
}
// Calculate the last item to be processed by the current thread
int end = i == (nthreads - 1) ? listWithFileNames.Rows.Count : i * load + load;
// Create a new thread to deal with a part of the big table.
Thread t = new Thread(new ParameterizedThreadStart((x) =>
{
// This is the inside of the thread, we must unbox the parameters
object[] vars = x as object[];
int lIndex = (int)vars[0];
int uIndex = (int)vars[1];
ManualResetEvent ev = vars[2] as ManualResetEvent;
StringBuilder sb = vars[3] as StringBuilder;
bool coma = false;
// Concatenate the rows in the string builder
for (int j = lIndex; j < uIndex; j++)
{
if (coma)
{
sb.Append(", ");
}
else
{
coma = true;
}
sb.Append("'").Append(listWithFileNames.Rows[j]["Filename"]).Append("'");
}
// Tell the parent Thread that your job is done.
ev.Set();
}));
// Start the thread with the calculated params
t.Start(new object[] { i * load, end, mres[i], sbuilders[i] });
}
// Wait for all child threads to finish their job
WaitHandle.WaitAll(mres.ToArray());
// Concatenate the big string.
for (int i = 1; i < nthreads; i++)
{
sbuilders[0].Append(", ").Append(sbuilders[i]);
}
sbuilders[0].Append(")"); // Close your where clause
// Return the finished where clause
return sbuilders[0].ToString();
}
// Returns empty
return whereClause;
}