Is there any way in which I can retrieve all the results from client.Search (I think this can be done using scroll API) and create a pagination for these results when displaying them? Is there any API from ES for doing so?
Or using From() and size(), can it be done?
For eg: Lets say, I have 100,000 documents on the index and when I search for a keyword it generates some 200 results. How can I use scroll, from and size to show them?
TIA
We use from and size options to implement pagination for Elasticsearch results.
The code snippet can be something like below:
def query(page)
size = 10
page ||= 1
from = (page-1) * size
# elasticsearch query with from * size options
end
You may need to know total number of results to implement pagination without sending a additional count request. To get the total results, you can use the total field of the response.
=== Updated
If you want to get the search results of the first page, then you can use query(1). If you want to get the search results of the second page, then you can use query(2) and so on.
The purpose of the scroll is slightly different. Let's say you need to get all records of the search results and the number of results are too large (eg., millions of results). If you retrieve all the data at once, it will arise a kind of memory issue or problems because of the high load. In this case, you can use scroll to fetch results step by step.
For the pagination, you don't need to get all data of the search results. You only need to get some data of a specific page. In this case, you may need to use just query with from and size options NOT scroll.
Related
I have a question about Pageable< T > in C#. I have table storage in Azure named- Domains. I am using Azure.Data.Tables nuget package, and to Query all the domains i am using this :
var domains = _localDomainTableClient
.Query<Domain>()
.AsPages()
.SelectMany(d => d.Values);
But i dont understant something. What if i use Query< T > without AsPages method ?
IEnumerable<Domain> domainsPAges = tableClient.Query<Domain>();
I know that AsPages() returns a collection of pages. For example if I have 10000 items in the table, Query<Domain>().AsPages() should make 10 requests to the table and return me 10 pages with 1000 items in each page (unless I changed the default value) but I don't understand what exactly is happening if I don't use AsPages() ?
Example:
IEnumerable<Domain> domainsPAges = tableClient.Query<Domain>();
Query<Domain>() return Pageble< T > but, does it make 10 requests to the table again or does it take all the elements until the memory overflows (4 MB by default) or take all the elements at once ?
I check documentation, but I couldn't find what I needed.
A collection of values that may take multiple service requests to iterate over.
A collection of values retrieved in pages
What is it even mean ?
Thanks for help.
How many requests Query will make, if AsPages() is not used?
In C#, the Query method is part of the Azure Cosmos DB SDK.
The Query<T> method will retrieve all matching results in a single request to the database. This behaviour is known as a single-partition query.
If the query involves multiple partitions, then the Query method will make multiple requests to retrieve all matching results. The exact number of requests will depend on the number of partitions involved and the size of the result set.
If the AsPages() method is not used, then the query results will be returned as a single enumerable object. If the AsPages() method is used, the query results will be returned as a sequence of pages, each containing a subset of the query results. In this case, the number of requests made will depend on the size of each page and the total number of results matching the query criteria.
What if i use Query< T > without AsPages method
In C#, if you use Query<T> without the AsPages method, you will receive all the results of the query in a single batch.
The Query<T> method returns an IEnumerable that represents the results of the query. If you do not use AsPages, the entire result set will be loaded into memory at once. This leads to high memory usage and performance can be degraded, as it can consume a lot of memory.
AsPages method also enables you to retrieve the results in a more efficient manner. It allows you to retrieve the results in smaller batches or pages, which can help reduce the memory usage and improve performance. The AsPages method returns an IAsyncEnumerable<Page<T>>, where each Page represents a page of results.
Query() return Pageble< T > but, does it make 10 requests to the table again or does it take all the elements until the memory overflows (4 MB by default) or take all the elements at once?
The behaviour of Query() depends on the implementation of the method and the underlying database system.
when you execute a query that returns a pageable result, the database will execute the query and return only a subset of the results to the application, based on the page size and the current page number. The remaining results are retrieved on subsequent requests for the next pages.
So if you call Query<Domain>() with a specific page size, it will only retrieve the number of elements specified in that page size, and not all the elements in the table.
The number of requests made to the table will depend on the number of pages that need to be retrieved to return all the results for the query.
The memory used to store the retrieved results will depend on the size of the objects retrieved and the number of objects retrieved in each page. When using pageable results, the size of the objects returned is generally limited to reduce memory usage, so it's unlikely that the memory usage will exceed the default limit of 4 MB.
For more information refer this Query table entities and SO Thread.
I have a web application in which I get data from my database and show in a datatable. I am facing an issue doing this as the data that I am fetching has too many rows(200 000). So when I query something like select * from table_name;
my application gets stuck.
Is there a way to handle this problem with JavaScript?
I tried pagination but I cannot figure how would i do that as datatable creates pagination for already rendered data?
Is there a way through which I can run my query through pagination at
the backend?
I have come across the same problem when working with mongodb and angularjs. I used server side paging. Since you have huge number of records, You can try using the same approach.
Assuming a case that you are displaying 25 records in one page.
Backend:
Get the total count of the records using COUNT query.
select * from table_name LIMIT 25 OFFSET
${req.query.pageNumber*25} to query limited records based on the page number;
Frontend:
Instead of using datatable, display the data in HTML table it self.
Define buttons for next page and previous page.
Define global variable in the controller/js file for pageNumber.
Increment pageNumber by 1 when next page button is clicked and
decrement that by 1 when prev button is pressed.
use result from COUNT query to put upper limit to pageNumber
variable.(if 200 records are there limit will be 200/25=8).
So basically select * from table_name LIMIT 25 OFFSET
${req.query.pageNumber*25} will limit the number of records to 25. when req.query.pageNumber=1, it will offset first 25records and sends next 25 records. similarly if req.query.pageNumber=2, it will offset first 2*25 records and sends 51-75 records.
There are two ways to handle.
First way - Handling paging in client side
Get all data from database and apply custom paging.
Second way - Handling paging in server side
Every time you want to call in database and get records according to pagesize.
You can use LIMIT and OFFSET constraints for pagination in MySQL. I understand that at a time 2 lacs data makes performance slower. But as you mention that you have to use JS for that. So make it clear that if you wants js as frontend then it is not going to help you. But as you mention that you have a web application, If that application is on Node(as server) then I can suggest you the way, which can help you a lot.
use 2 variables, named var_pageNo and var_limit. Now use the row query of mysql as
select * form <tbl_name> LIMIT var_limit OFFSET (var_pageNo * var_limit);
Do code according to this query. Replace the variable with your desire values. This will make your performance faster, and will fetch the data as per your specified limit.
hope this will helpful.
Problem summary:
C# (MVC), entity framework 5.0 and Oracle.
I have a couple of million rows in a view which joins two tables.
I need to populate dropdownlists with filter-posibilities.
The options in these dropdownlists should reflect the actual contents
of the view for that column, distinct.
I want to update the dropdownlists whenever you select something, so
that the new options reflect the filtered content, preventing you
from choosing something that would give 0 results.
Its slow.
Question: whats the right way of getting these dropdownlists populated?
Now for more detail.
-- Goal of the page --
The user is presented with some dropownlists that filter the data in a grid below. The grid represents a view (see "Database") where the results are filtered.
Each dropdownlist represents a filter for a column of the view. Once something is selected, the rest of the page updates. The other dropdownlists now contain the posible values for their corresponding columns that complies to the filter that was just applied in the first dropdownlist.
Once the user has selected a couple of filters, he/she presses the search button and the grid below the dropdownlists updates.
-- Database --
I have a view that selects almost all columns from two tables, nothing fancy there. Like this:
SELECT tbl1.blabla, tbl2.blabla etc etc
FROM table1 tbl1, table2 tbl2
WHERE bsl.bvz_id = bvz.id AND bsl.einddatum IS NULL;
There is a total of 22 columns. 13 VARCHARS (mostly small, 1 - 20, one of em has a size of 2000!), 6 DATES and 3 NUMBERS (one of them size 38 and one of them 15,2).
There are a couple of indexes on the tables, among which the relevant ID's for the WHERE clause.
Important thing to know: I cannot change the database. Maybe set an index here and there, but nothing major.
-- Entity Framework --
I created a Database first EDMX in my solution and also mapped the view. There are also classes for both tables, but I need data from both of them, so I don't know if I need them. The problem by selecting things from either table would be that you can't apply half of the filtering, but maybe there are smart way's I didn't think of yet.
-- View --
My view is strongly bound to a viewModel. In there I have a IEnumerable for each dropdownlist. The getter for these gets its data from a single IEnumerable called NameOfViewObjects. Like this:
public string SelectedColumn1{ get; set; }
private IEnumerable<SelectListItem> column1Options;
public IEnumerable<SelectListItem> Column1Options
{
get
{
if (column1Options == null)
{
column1Options= NameOfViewObjects.Select(item => item.Column1).Distinct()
.Select(item => new SelectListItem
{
Value = item,
Text = item,
Selected = item.Equals(SelectedColumn1, StringComparison.InvariantCultureIgnoreCase)
});
}
return column1Options;
}
}
The two solutions I've tried are:
- 1 -
Selecting all columns in a linq query I need for the dropdownlists (the 2000 varchar is not one of them and there are only 2 date columns), do a distinct on them and put the results into a Hashset. Then I set NameOfViewObjects to point towards this hashset. I have to wait for about 2 minutes for that to complete, but after that, populating the dropdownlists is almost instant (maybe a second for each of them).
model.Beslissingen = new HashSet<NameOfViewObject>(dbBes.NameOfViewObject
.DistinctBy(item => new
{
item.VarcharColumn1,
item.DateColumn1,
item.DateColumn2,
item.VarcharColumn2,
item.VarcharColumn3,
item.VarcharColumn4,
item.VarcharColumn5,
item.VarcharColumn6,
item.VarcharColumn7,
item.VarcharColumn8
}
)
);
The big problem here is that the object NameOfViewObject is probably quite large, and even though using distinct here, resulting in less than 100.000 results, it still uses over 500mb of memory for it. This is unacceptable, because there will be a lot of users using this screen (a lot would be... 10 max, 5 average simultaniously).
- 2 -
The other solution is to use the same linq query and point NameOfViewObjects towards the IQueryable it produces. This means that every time the view wants to bind a dropdownlist to a IEnumerable, it will fire a query that will find the distinct values for that column in a table with millions of rows where most likely the column it's getting the values from is not indexed. This takes around 1 minute for each dropdownlist (I have 10), so that takes ages.
Don't forget: I need to update the dropdownlists every time one of them has it's selection changed.
-- Question --
So I'm probably going at this the wrong way, or maybe one of these solutions should be combined with indexing all of the columns I use, maybe I should use another way to store the data in memory, so it's only a little, but there must be someone out there who has done this before and figured out something smart. Can you please tell me what would be the best way to handle a situation like this?
Acceptable performance:
having to wait for a while (2 minutes) while the page loads, but
everything is fast after that.
having to wait for a couple of seconds every time a dropdownlist
changes
the page does not use more than 500mb of memory
Of course you should have indexes on all columns and combinations in WHERE clauses. No index means table scan and O(N) query times. Those cannot scale under any circumstance.
You do not need millions of entries in a drop down. You need to be smarter about filtering the database down to manageable numbers of entries.
I'd take a page from Google. Their type ahead helps narrow down the entire Internet graph into groups of 25 or 50 per page, with the most likely at the top. Maybe you could manage that, too.
Perhaps a better answer is something like a search engine. If you were a Java developer you might try Lucene/SOLR and indexing. I don't know what the .NET equivalent is.
First point you need to check is your DB, make sure you have to right indexes and entity relations in place,
next if you want to dynamical build your filter options then you need to run the query with the existing filters to obtain what the next filter can be. there are several ways to do this,
firstly you can query the data and extract the values from the return, this has a huge load time and wastes time returning data you don't want (unless you are live updating the results with the filter and dont have paging, in which case you might aswell just get all the data and use linqToObjects to filter)
a second option is to have a parallel queries for each filter that returns the possible filters, so filter A = all possible values of A from data, filter b = all possible values of B when filtered by A in the data, C = all possible values of C when filtered by A & B in the data, etc. this is better than the first but not by much
another option is the use aggregates to speed things up, ie you have a parallel query as above but instead of returning the data you return how many records are returned, aggregate functions are always quicker so this will cut your load time dramatically but you are still repeatedly querying a huge dataset to it wont be exactly nippy.
you can tweak this further using exist to just return a 0 or 1.
in this case you would look at a table with all possible filters and then remove the ones with no values from the parallel query
the next option will be the fastest by a mile is to cache the filters in the DB, with a separate table
then you can query that and say from Cache, where filter = ABC select D, the problem with this maintaining the cache, which you would have to do in the DB as part of the save functions, trigggers etc.
Another solution that can be added in addition to the previous suggestions is to use the /*+ result_cache */ hint, if your version of Oracle supports it (Oracle version 11g or later). If the output of the query is small enough for a drop-down list, then when a user enters criteria that matches the same criteria another user used, the results are returned in a few milliseconds instead of a few seconds or minutes. Result cache is wonderful for queries that return a small set of rows out of millions.
select /*+ result_cache */ item_desc from some_table where item_id ...
The result cache is automatically flushed when any insert/updates/deletes occur on the database tables.
I've done something 'kind of' similar in the past - if you can add a table to the database then I'd explore introducing a 'scratchpad' type table where results are temporarily stored as the user refines their search. Since multiple users could be working simultaneously the table would have to have an additional column for identifying the user.
I'd think you'd see some performance benefit since all processing is kept server-side and your app would simply be pulling data from this table. Since you're adding this table you would also have total control over it.
Essentially I'd imagine the program flow would go something like:
User selects some filters and clicks 'Search'.
Server populates scratchpad table with results from that search.
App populates results grid from scratchpad table.
User further refines search and clicks 'Search'.
Server removes/adds rows to scratchpad table as necessary.
App populates results grid from scratchpad table.
And so on.
Rather than having all the users results in one 'scratchpad' table you could possibly explore having temporary 'scratchpad' tables per user.
I currently have a sharepoint2010 list that contains roughly 200,000 records.
I want to get each record, tweak it, massage it, and store it in a SQL table.
As of now, I am using the sharepoint 2010 web service method GetListItems like so...
System.Xml.XmlNode nodeListItems = client.GetListItems("guid", "", query, viewFields, RowNumber, queryOptions, null);
querying 200000 records is too much to query at once. How can I get around this? The GetListItems method takes CAML query parameters.
Is there a way to do this in increments, like say 5000 records at a time? How would one structure the CAML query to do that?
Unless someone has a better way of accomplishing this altogether?
Yes, there is, you can paginate the results. The fifth parameter is the page size, you have it set via RowNumber. Set it to 5000 if you want pages of size 5000.
Details on accessing subsequent pages can be seen from the documentation for the GetListItems method
The GetListItems method supports server-side paging. The XML data
returned by this method includes a ListItemCollectionPositionNext
attribute inside the rs:Data element that contains the information to
support paging. This string contains data for the fields in the sort
and for other items needed for paging. You should consider this string
internal and not to be modified; modifying it can produce unexpected
results. The following example shows the form of this return value
when paging is supported.
<rs:Data ListItemCollectionPositionNext="
Paged=TRUE&p_ID=100&View=
%7bC68F4A6A%2d9AFD%2d406C%2dB624%2d2CF8D729901E%7d&PageFirstRow=
101" Count=1000 >
<z:row ows_FirstName="Nancy" ows_LastName="Name" ….. />
...
</rs:Data>
To get the next page of data, the queryOption parameter is used, as
shown in the following example.
<QueryOptions>
<Paging ListItemCollectionPositionNext="
Paged=TRUE&p_ID=100&View=
%7bC68F4A6A%2d9AFD%2d406C%2dB624%2d2CF8D729901E%7d&PageFirstRow=
101" />
</QueryOptions>
So all you need to do is grab the data in that attribute from each page's result set and add it to the query to get the next page.
I have the following that returns a collection from Azure table storage where Skip is not implemented. The number of rows returned is approximately 500.
ICollection<City> a = cityService.Get("0001I");
What I would like to do is to be able to depending on an argument have just the following ranges returned:
records 1-100 passing in 0 as an argument to a LINQ expression
records 101-200 passing in 100 as an argument to a LINQ expression
records 201-300 passing in 200 as an argument to a LINQ expression
records 301-400 passing in 300 as an argument to a LINQ expression
etc
Is there some way I can add to the above and use link to get these ranges
of records returned:
As you already stated in your question, the Skip method is not implemented in Windows Azure Table storage. This means you have 2 options left:
Option 1
Download all data from table storage (by using ToList, see abatishchev's answer) and execute the Skip and Take methods on this complete list. In your question you're talking about 500 records. If the number of records doesn't grow too much this solution should be OK for you, just make sure that all records have the same partition key.
If the data grows you can still use this approach, but I suggest you evaluate a caching solution to store all the records instead of loading them from table storage over and over again (this will improve the performance, but don't expect this to work with very large amounts of data). Caching is possible in Windows Azure using:
Windows Azure Caching (Preview)
Windows Azure Shared Caching
Option 2
The CloudTableQuery class allows you to query for data, but more important to receive a continuation token to build a paging implementation. This allows you to detect if you can query for more data, the pagination example on Scott's blogpost (see nemensv's comment) uses this.
For more information on continuation tokens I suggest you take a look at Jim's blogpost: Azure#home Part 7: Asynchronous Table Storage Pagination. By using continuation tokens you only download the data for the current page meaning it will also work correctly even if you have millions of records. But you have to know the downside of using continuation tokens:
This won't work with the Skip method out of the box, so it might not be a solution for you.
No page 'numbers', because you only know if there's more data (not how much)
No way to count all records
If paging is not supported by the underlying engine, the only way to implement it is to load all the data into memory and then perform paging:
var list = cityService.Get("0001I").ToList(); // meterialize
var result = list.Skip(x).Take(y);
Try something like this:
cityService.Get("0001I").ToList().Skip(n).Take(100);
This should return records 201-300:
cityService.Get("0001I").ToList().Skip(200).Take(100);
a.AsEnumerable().Skip(m).Take(n)