NEST Search whole document C# Elasticsearch - c#

I want to make a query over a million documents in Elasticsearch using Nest. My code:
var response = client.Search<MyObject>(s => s
.Index("test")
.Type("one")
.Query(q => q.
Term(
t => t.name, "A"
)
)
.Size(10000)
.Scroll("10m")
.Pretty()
);
My MyObject class:
public class MyObject
{
public int id { get; set; }
public int age { get; set; }
public string lastname { get; set; }
public string name { get; set; }
}
The problem is when this query is not found in the first 10k documents, it won't continue searching the rest of the results scroll API.
My question is how to achieve this (i.e moving through the whole pages in Scroll API despite there is no hits..)?

The query will search all documents, but will only return you the top .Size number of documents.
You can paginate results using .From() and .Size(), however, deep pagination is likely a concern when paginating over a million documents. For this, you would be better to use the scroll API to efficiently retrieve 1 million documents. NEST has an observable helper ScrollAll() to help with this
var client = new ElasticClient();
// number of slices in slice scroll
var numberOfSlices = 4;
var scrollObserver = client.ScrollAll<MyObject>("1m", numberOfSlices, s => s
.MaxDegreeOfParallelism(numberOfSlices)
.Search(search => search
.Index("test")
.Type("one")
.Term(t => t.name, "A")
)
).Wait(TimeSpan.FromMinutes(60), r =>
{
// do something with documents from a given response.
var documents = r.SearchResponse.Documents;
});

Related

EF Core 5 check if all ids from filter exists in related entities

I have two models:
public class Employee
{
public int Id { get; set; }
public IList<Skill> { get; set; }
}
public class Skill
{
public int Id { get; set; }
}
And I have filter with list of skill ids, that employee should contain:
public class Filter
{
public IList<int> SkillIds { get; set; }
}
I want to write query to get all employees, that have all skills from filter.
I tried:
query.Where(e => filter.SkillIds.All(id => e.Skills.Any(skill => skill.Id == id)));
And:
query = query.Where(e => e.Skills
.Select(x => x.Id)
.Intersect(filter.SkillIds)
.Count() == filter.SkillIds.Count);
But as a result I get exception says that query could not be translated.
It is going to be a difficult, if not impossible task, to run a query like this on the sql server side.
This is because to make this work on the SQL side, you would be grouping each set of employee skills into a single row which would need to have a new column for every skill listed in the skills table.
SQL server wasn't really made to handle grouping with an unknown set of columns passed into a query. Although this kind of query is technically possible, it's probably not very easy to do through a model binding framework like ef core.
It would be easier to do this on the .net side using something like:
var employees = _context.Employees.Include(x=>x.Skill).ToList();
var filter = someFilter;
var result = employees.Where(emp => filter.All(skillID=> emp.skills.Any(skill=>skill.ID == skillID))).ToList()
This solution works:
foreach (int skillId in filter.SkillIds)
{
query = query.Where(e => e.Skills.Any(skill => skill.Id == skillId));
}
I am not sure about it's perfomance, but works pretty fast with small amount of data.
I've also encountered this issue several times now, this is the query I've come up with that I found works best and does not result in an exception.
query.Where(e => e.Skills.Where(s => filter.SkillIds.Contains(s.Id)).Count() == filter.SkillIds.Count);

How to Fetch a Lot of Records with EF6

I need to fetch a lot of records from a SQL Server database with EF6. The problem that its takes a lot of time. The main problem is entity called Series which contains Measurements. There is like 250K of them and each has 2 nested entities called FrontDropPhoto and SideDropPhoto.
[Table("Series")]
public class DbSeries
{
[Key] public Guid SeriesId { get; set; }
public List<DbMeasurement> MeasurementsSeries { get; set; }
}
[Table("Measurements")]
public class DbMeasurement
{
[Key] public Guid MeasurementId { get; set; }
public Guid CurrentSeriesId { get; set; }
public DbSeries CurrentSeries { get; set; }
public Guid? SideDropPhotoId { get; set; }
[ForeignKey("SideDropPhotoId")]
public virtual DbDropPhoto SideDropPhoto { get; set; }
public Guid? FrontDropPhotoId { get; set; }
[ForeignKey("FrontDropPhotoId")]
public virtual DbDropPhoto FrontDropPhoto { get; set; }
}
[Table("DropPhotos")]
public class DbDropPhoto
{
[Key] public Guid PhotoId { get; set; }
}
I've wrote fetch method like this (Most of the properties omitted for clarity):
public async Task<List<DbSeries>> GetSeriesByUserId(Guid dbUserId)
{
using (var context = new DDropContext())
{
try
{
var loadedSeries = await context.Series
.Where(x => x.CurrentUserId == dbUserId)
.Select(x => new
{
x.SeriesId,
}).ToListAsync();
var dbSeries = new List<DbSeries>();
foreach (var series in loadedSeries)
{
var seriesToAdd = new DbSeries
{
SeriesId = series.SeriesId,
};
seriesToAdd.MeasurementsSeries = await GetMeasurements(seriesToAdd);
dbSeries.Add(seriesToAdd);
}
return dbSeries;
}
catch (SqlException e)
{
throw new TimeoutException(e.Message, e);
}
}
}
public async Task<List<DbMeasurement>> GetMeasurements(DbSeries series)
{
using (var context = new DDropContext())
{
var measurementForSeries = await context.Measurements.Where(x => x.CurrentSeriesId == series.SeriesId)
.Select(x => new
{
x.CurrentSeries,
x.CurrentSeriesId,
x.MeasurementId,
})
.ToListAsync();
var dbMeasurementsForAdd = new List<DbMeasurement>();
foreach (var measurement in measurementForSeries)
{
var measurementToAdd = new DbMeasurement
{
CurrentSeries = series,
MeasurementId = measurement.MeasurementId,
FrontDropPhotoId = measurement.FrontDropPhotoId,
FrontDropPhoto = measurement.FrontDropPhotoId.HasValue
? await GetDbDropPhotoById(measurement.FrontDropPhotoId.Value)
: null,
SideDropPhotoId = measurement.SideDropPhotoId,
SideDropPhoto = measurement.SideDropPhotoId.HasValue
? await GetDbDropPhotoById(measurement.SideDropPhotoId.Value)
: null,
};
dbMeasurementsForAdd.Add(measurementToAdd);
}
return dbMeasurementsForAdd;
}
}
private async Task<DbDropPhoto> GetDbDropPhotoById(Guid photoId)
{
using (var context = new DDropContext())
{
var dropPhoto = await context.DropPhotos
.Where(x => x.PhotoId == photoId)
.Select(x => new
{
x.PhotoId,
}).FirstOrDefaultAsync();
if (dropPhoto == null)
{
return null;
}
var dbDropPhoto = new DbDropPhoto
{
PhotoId = dropPhoto.PhotoId,
};
return dbDropPhoto;
}
}
Relationships configured via FluentAPI:
modelBuilder.Entity<DbSeries>()
.HasMany(s => s.MeasurementsSeries)
.WithRequired(g => g.CurrentSeries)
.HasForeignKey(s => s.CurrentSeriesId)
.WillCascadeOnDelete();
modelBuilder.Entity<DbMeasurement>()
.HasOptional(c => c.FrontDropPhoto)
.WithMany()
.HasForeignKey(s => s.FrontDropPhotoId);
modelBuilder.Entity<DbMeasurement>()
.HasOptional(c => c.SideDropPhoto)
.WithMany()
.HasForeignKey(s => s.SideDropPhotoId);
I need all of this data to populate WPF DataGrid. The obvious solution is to add paging to this DataGrid. This solution is tempting but it will break the logic of my application badly. I want to create plots at runtime using this data, so I need all of it, not just some parts. I've tried to optimize it a bit by make every method to use async await, but it wasn't helpful enough. I've tried to add
.Configuration.AutoDetectChangesEnabled = false;
for each context, but loading time is still really long. How to approach this problem?
Other than the very large amount of data that you are intent on returning, the main problem is that the way your code is structured means that for each of the 250,000 Series you are performing another trip to the database to get the Measurements for the Series and a further 2 trips to get the front/side DropPhotos for each Measurement. Apart from the round-trip time for the 750,000 calls this completely avoids taking advantage of SQL's set-based performance optimisations.
Try to ensure that EF submits as few queries as possible to return your data, preferably one:
var loadedSeries = await context.Series
.Where(x => x.CurrentUserId == dbUserId)
.Select(x => new DbSeries
{
SeriesId = x.SeriesId,
MeasurementsSeries = x.MeasurementsSeries.Select(ms => new DbMeasurement
{
MeasurementId = ms.MeasurementId,
FrontDropPhotoId = ms.FrontDropPhotoId,
FrontDropPhoto = new DbDropPhoto
{
PhotoId = ms.FrontDropPhotoId
},
SideDropPhotoId = ms.SideDropPhotoId,
SideDropPhoto = new DbDropPhoto
{
PhotoId = ms.SideDropPhotoId
},
})
}).ToListAsync();
Firstly, async/await will not help you here. It isn't a "go faster" type of operation, it is about accommodating systems that "can be doing something else while this operation is computing". If anything, it makes an operation slower in exchange for making a system more responsive.
My recommendation would be to separate your concerns: On the one hand you want to display detailed data. On the other hand you want to plot an overall graph. Separate these. A user doesn't need to see details for every record at one time, paginating it server-side will greatly reduce the raw amount of data at any one time. Graphs want to see all data, but they don't care about "heavy" details like bitmaps.
The next thing would be to separate your view's model from your domain model (entity). Doing stuff like:
var measurementToAdd = new DbMeasurement
{
CurrentSeries = series,
MeasurementId = measurement.MeasurementId,
FrontDropPhotoId = measurement.FrontDropPhotoId,
FrontDropPhoto = measurement.FrontDropPhotoId.HasValue
? await GetDbDropPhotoById(measurement.FrontDropPhotoId.Value)
: null,
SideDropPhotoId = measurement.SideDropPhotoId,
SideDropPhoto = measurement.SideDropPhotoId.HasValue
? await GetDbDropPhotoById(measurement.SideDropPhotoId.Value)
: null,
};
... is just asking for trouble. Any code that accepts a DbMeasurement should receive a complete, or completable DbMeasurement, not a partially populated entity. It will burn you in the future. Define a view model for the data grid and populate it. This way you clearly differentiate what is an entity model and what is the view's model.
Next, for the data grid, absolutely implement server-side pagination:
public ICollection<MeasurementViewModel> GetMeasurements(int seriesId, int pageNumber, int pageSize)
{
using (var context = new DDropContext())
{
var measurementsForSeries = await context.Measurements
.Where(x => x.CurrentSeriesId == seriesId)
.Select(x => new MeasurementViewModel
{
MeasurementId = x.MeasurementId,
FromDropPhoto = x.FromDropPhoto.ImageData,
SideDropPhoto = x.SideDropPhoto.ImageData
})
.Skip(pageNumber*pageSize)
.Take(pageSize)
.ToList();
return measurementsForSeries;
}
}
This assumes that we want to pull image data for the rows if available. Leverage the navigation properties for related data in the query rather than iterating over results and going back to the database for each and every row.
For the graph plot you can return either the raw integer data or a data structure for just the fields needed rather than relying on the data returned for the grid. It can be pulled for the entire table without having the "heavy" image data. It may seem counter-productive to go to the database when the data might already be loaded once already, but the result is two highly efficient queries rather than one very inefficient query trying to serve two purposes.
Why are you reinventing the wheel and manually loading and constructing your related entities? You’re causing an N+1 selects problem resulting in abhorrent performance. Let EF query for related entities efficiently via .Include
Example:
var results = context.Series
.AsNoTracking()
.Include( s => s.MeasurementSeries )
.ThenInclude( ms => ms.FrontDropPhoto )
.Where( ... )
.ToList(); // should use async
This will speed up execution dramatically though it may still not be quick enough for your requirments if it needs to construct hundreds of thousands to millions of objects, in which case you can retrieve the data in concurrent batches.

How do I group on one of two possible fields using LINQ?

I am trying to get the latest contact with a given user, grouped by user:
public class ChatMessage
{
public string SentTo { get; set; }
public string SentFrom { get; set; }
public string MessageBody { get; set; }
public string SendDate { get; set; }
}
The user's contact info could either be in SentTo or SentFrom.
List<ChatMessage> ecml = new List<ChatMessage>();
var q = ecml.OrderByDescending(m => m.SendDate).First();
would give me the latest message, but I need the last message per user.
The closest solution I could find was LINQ Max Date with group by, but I cant seem to figure out the correct syntax. I would rather not create multiple List objects if I don't have to.
If the user's info is in SentTo, my info will be in SentFrom, and vice-versa, so I do have some way of checking where the user's data is.
Did I mention I was very new to LINQ? Any help would be greatly appreciated.
Since you need to interpret each record twice - i.e. as a SentTo and a SentFrom, the query becomes a bit tricky:
var res = ecml
.SelectMany(m => new[] {
new { User = m.SentFrom, m.SendDate }
, new { User = m.SentTo, m.SendDate }
})
.GroupBy(p => p.User)
.Select(g => new {
User = g.Key
, Last = g.OrderByDescending(m => m.SendDate).First()
});
The key trick is in SelectMany, which makes each ChatMessage item into two anonymous items - one that pairs up the SentFrom user with SendDate, and one that pairs up the SentTo user with the same date.
Once you have both records in an enumerable, the rest is straightforward: you group by the user, and then apply the query from your post to each group.
It should be pretty easy, look at this code:
string username = "John";
var q = ecml.Where(i=>i.SentFrom == username || i.SentTo == username).OrderByDescending(m => m.SendDate).First();
It simply filter your collection be choosing items which either SentFrom or SentTo is equal to username.

Windows Azure Mobile Service query table

I use Windows Azure Mobile Service.
I have a table of Element.
I want to query the cloud database :
select Id, Name
FROM Element ORDER BY creationTime
But I don't understand at all the "query" system with Windows Azure Mobile Service.
I have a IMobileServiceTable but don't know what to do with that...
I checked on tutorial, and they explain how to use Where clause, but not select. And I need to select only some column because my element have picture and I don't want to download it in my getAll method....
Edit :
I try that :
Task.Factory.StartNew(() =>
{
var query = table.Select(x =>
new Element()
{
Id = x.Id,
Name = x.Name,
Price = x.Price
});
var _items = query.ToListAsync().Result;
}).ContinueWith((x) => handleProductsArrived(x.Result));
But it doesn't work.
You can find a helpful post from Carlos that includes what the corresponding SQL query would be here: http://blogs.msdn.com/b/carlosfigueira/archive/2012/09/21/playing-with-the-query-object-in-read-operations-on-azure-mobile-services.aspx
For example:
function read(query, user, request) {
query.where({ UserId: user.userId })
.select('id', 'MovieName', 'MovieRating')
.orderBy('MovieName')
.take(10);
request.execute();
}
woudld translate to
SELECT TOP 10 [id], [MovieName], [MovieRating]
FROM MovieRating
WHERE Rating > 2 AND UserId = ?
ORDER BY MovieName
So for your case where you need to translate
SELECT Id, Name
FROM Element
ORDER BY creationTime
you'd go with something like the following:
function read(query, user, request) {
query.where({ UserId: user.userId })
.select('id', 'Name', 'Element')
.orderBy('creationTime')
request.execute();
}
It sounds like you are just looking to do a simple query with IMobileServiceTable
SELECT Id, Name FROM Element ORDER BY creationTime
If you do not mind using the IMobileServiceTable<TodoItem>, you can try:
1) Removing the member properties you do not need from your Object
Example:
public class TodoItem
{
public int Id { get; set; }
// REMOVE WHAT YOU DO NOT WANT
//[DataMember(Name = "text")]
//public string Text { get; set; }
[DataMember(Name = "complete")]
public bool Complete { get; set; }
}
2) Here's the code to read the data:
private void RefreshTodoItems()
{
items = todoTable
.OrderBy( todoItem => todoItem.Id )
.Take(10)
.ToCollectionView();
ListItems.ItemsSource = items;
}
which is basically:
SELECT TOP 10 Id, Complete FROM TodoTable ORDER BY Id
The code example for todoTable is at http://www.windowsazure.com/en-us/develop/mobile/tutorials/get-started-wp8/
Hope this helps.
If you're using .net, you pretty much follow linq.
Looking at the sample app - where it has -
private void RefreshTodoItems()
{
// This code refreshes the entries in the list view be querying the TodoItems table.
// The query excludes completed TodoItems
items = todoTable
.Where(todoItem => todoItem.Complete == false)
.ToCollectionView();
ListItems.ItemsSource = items;
}
If, for example, you did not want to return the Complete flag you could add before the call to .ToCollectionView()
.Select(item=>new {item.Id, item.Text})
Which would create a list of a new object of anonymous type (can be a concrete type) with the two members specified.

RavenDB index for nested query

I'm pretty new to RavenDB and am struggling to find a solution to the following:
I have a collection called ServiceCalls that look like this:
public class ServiceCall
{
public int ID { get; set; }
public string IncidentNumber { get; set; }
public string Category { get; set; }
public string SubCategory { get; set; }
public DateTime ReportedDateTime { get; set; }
public string Block { get; set; }
public decimal Latitude { get; set; }
public decimal Longitude { get; set; }
}
I have an index named ServiceCalls/CallsByCategory that looks like this:
Map = docs => from doc in docs
select new
{
Category = doc.Category,
CategoryCount = 1,
ServiceCalls = doc,
};
Reduce = results => from result in results
group result by result.Category into g
select new
{
Category = g.Key,
CategoryCount = g.Count(),
ServiceCalls = g.Select(i => i.ServiceCalls)
};
So the output is:
public class ServiceCallsByCategory
{
public string Category { get; set; }
public int CategoryCount { get; set; }
public IEnumerable<ServiceCall> ServiceCalls { get; set; }
}
using this query everything works as it should
var q = from i in session.Query<ServiceCallsByCategory>("ServiceCalls/CallsByCategory") select i
Where I am absolutely lost is writing an index that would allow me to query by ReportedDateTime. Something that would allow me to do this:
var q = from i in session.Query<ServiceCallsByCategory>("ServiceCalls/CallsByCategory")
where i.ServiceCalls.Any(x=>x.ReportedDateTime >= new DateTime(2012,10,1))
select i
Any guidance would be MUCH appreciated.
A few things,
You can't have a .Count() method in your reduce clause. If you look closely, you will find your counts are wrong. As of build 2151, this will actually throw an exception. Instead, you want CategoryCount = g.Sum(x => x.CategoryCount)
You always want the structure of the map to match the structure of the reduce. If you're going to build a list of things, then you should map a single element array of each thing, and use .SelectMany() in the reduce step. The way you have it now only works due to a quirk that will probably be fixed at some point.
By building the result as a list of ServiceCalls, you are copying the entire document into the index storage. Not only is that inefficient, but it's unnecessary. You would do better keeping a list of just the ids. Raven has an .Include() method that you can use if you need to retrieve the full document. The main advantage here is that you are guaranteed to have the most current data for each item you get back, even if your index results are still stale.
Putting all three together, the correct index would be:
public class ServiceCallsByCategory
{
public string Category { get; set; }
public int CategoryCount { get; set; }
public int[] ServiceCallIds { get; set; }
}
public class ServiceCalls_CallsByCategory : AbstractIndexCreationTask<ServiceCall, ServiceCallsByCategory>
{
public ServiceCalls_CallsByCategory()
{
Map = docs => from doc in docs
select new {
Category = doc.Category,
CategoryCount = 1,
ServiceCallIds = new[] { doc.ID },
};
Reduce = results => from result in results
group result by result.Category
into g
select new {
Category = g.Key,
CategoryCount = g.Sum(x => x.CategoryCount),
ServiceCallIds = g.SelectMany(i => i.ServiceCallIds)
};
}
}
Querying it with includes, would look like this:
var q = session.Query<ServiceCallsByCategory, ServiceCalls_CallsByCategory>()
.Include<ServiceCallsByCategory, ServiceCall>(x => x.ServiceCallIds);
When you need a document, you still load it with session.Load<ServiceCall>(id) but Raven will not have to make a round trip back to the server to get it.
NOW - that doesn't address your question about how to filter the results by date. For that, you really need to think about what you are trying to accomplish. All of the above would assume that you really want every service call shown for each category at once. Most of the time, that's not going to be practical because you want to paginate results. You probably DON'T want to even use what I've described above. I am making some grand assumptions here, but most of the time one would filter by category, not group by it.
Let's say you had an index that just counts the categories (the above index without the list of service calls). You might use that to display an overview screen. But you wouldn't be interested in the documents that were in each category until you clicked one and drilled into a details screen. At that point, you know which category you're in, and you can filter by it and reduce to a date range without a static index:
var q = session.Query<ServiceCall>().Where(x=> x.Category == category && x.ReportedDateTime >= datetime)
If I am wrong and you really DO need to show all documents from all categories, grouped by category, and filtered by date, then you are going to have to adopt an advanced technique like the one I described in this other StackOverflow answer. If this is really what you need, let me know in comments and I'll see if i can write it for you. You will need Raven 2.0 to make it work.
Also - be very careful about what you are storing for ReportedDateTime. If you are going to be doing any comparisons at all, you need to understand the difference between calendar time and instantaneous time. Calendar time has quirks like daylight savings transitions, time zone differences, and more. Instantaneous time tracks the moment something happened, regardless of who's asking. You probably want instantaneous time for your usage, which means either using a UTC DateTime, or switching to DateTimeOffset which will let you represent instantaneous time without losing the local contextual value.
Update
I experimented with trying to build an index that would use that technique I described to let you have all results in your category groups but still filter by date. Unfortunately, it's just not possible. You would have to have all ServiceCalls grouped together in the original document and express it in the Map. It doesn't work the same way at all if you have to Reduce first. So you really should just consider simple query for ServiceCalls once you are in a specific Category.
Could you add ReportedDateTime to the Map and aggregate it in the Reduce? If you only care about the max per category, something like this should be sufficient.
Map = docs => from doc in docs
select new
{
Category = doc.Category,
CategoryCount = 1,
ServiceCalls = doc,
ReportedDateTime
};
Reduce = results => from result in results
group result by result.Category into g
select new
{
Category = g.Key,
CategoryCount = g.Sum(x => x.CategoryCount),
ServiceCalls = g.Select(i => i.ServiceCalls)
ReportedDateTime = g.Max(rdt => rdt.ReportedDateTime)
};
You could then query it just based on the aggregated ReportedDateTime:
var q = from i in session.Query<ServiceCallsByCategory>("ServiceCalls/CallsByCategory")
where i.ReportedDateTime >= new DateTime(2012,10,1)
select i

Categories