I have a database table having 100 million records. Screen Shot is taken from Robomongo
Table Schema: There are 100 million records
When I run the following code. I get results, but It takes around 1 minute to get completed. I need to optimize the query to get results faster. What I have done till now is here. Please tell me the way forward to achieve the optimized result.
var collection = _database.GetCollection<BsonDocument>("FloatTable1");
var sw = Stopwatch.StartNew();
var builder = Builders<BsonDocument>.Filter;
int min = Convert.ToInt32(textBox13.Text); //3
int max = Convert.ToInt32(textBox14.Text); //150
var filt = builder.Gt("Value", min) & builder.Lt("Value", max);
var list = collection.Find(filt);
sw.Stop();
TimeSpan time = sw.Elapsed;
Console.WriteLine("Time to Fetch Record: " + time.ToString());
var sw1 = Stopwatch.StartNew();
var list1 = list.ToList();
sw1.Stop();
TimeSpan time1 = sw1.Elapsed;
Console.WriteLine("Time to Convert var to List: " + time1.ToString());
Console.WriteLine("Total Count in List: " + list1.Count.ToString());
Out Put is:
Time to Fetch Record: 00:00:00.0059207
Time to Convert var to List: 00:01:00.7209163
Total Count in List: 1003154
I have few question related to the given code.
When line collection.Find(filt) executes, does it fetch filtered record from the database OR Just creating filter?
var list1 = list.ToList(); takes 1 minute to execute, is it only converting from var to list OR First fetching data than converting?
How to achieve this query and result in least possible time. Please Help.
When line collection.Find(filt) executes, does it fetch filtered
record from the database OR Just creating filter?
It is just creating the filter.
var list1 = list.ToList(); takes 1 minute to execute, is it only
converting from var to list OR First fetching data than converting?
It is fetching the data and converting.
How to achieve this query and result in least possible time. Please Help.
The fetch / filtering on the database is eating your time. The easiest way to speed it up would be creating an index on the column you are filtering.
Everything else would need some more effort or database technologies, like creating a column which more roughly presents your date (e.g. grouped by day) and indexing this one, or creating something like table sections grouped by a given timespan (I'm not a DB-Admin and don't know the proper terms for this, but I remember somebody doing it on a database with billions of records ;) )
Related
Is there any way in table storage to read and then update a record? For example in SQL server I would use a query like this:
UPDATE table
SET
testValue = 1
OUTPUT
inserted.columnA,
inserted.columnB,
inserted.columnC
WHERE
testValue = 0
Currently my code looks like this:
var filter = "testValue eq 0";
var rangeQuery = new TableQuery<AzStorageEntityAdapter<T>>().Where(filter);
var result = _cloudTable.ExecuteQuery(rangeQuery);
var azStorageEntities = result.ToList();
IList<T> results = azStorageEntities.Select(r => r.InnerObject).ToList();
Is there some way to add a update clause along with my where clause when it reads the values that meet the filters criteria that 'testValue' is also updated to 1?
Unfortunately it is not possible in a single operation.
You must first fetch an entity (1st operation), update it and then save it back in the table (2nd operation).
I have code in my C# console app that is querying a LARGE dataset in SQL, and adding it to an IEnumerable collection that I use to iterate through later in the app. On a SQL table that returns less than 100K rows, it works great, but I have to use this to iterate through 100 Million records, After the SQL query runs, and Dapper tries to fill the collection, I end up with an OUT OF MEMORY exception error. I'm pretty certain it's because it's trying to write 100 Million objects at a time. Is there a way I can batch a collection with no more than say 500K objects, do what I need to do then come back and process another 500K and so on? I essentially need to READ from SQL 500K records, then write those to a file, Read another 500K , write to another file.
public List<AxDlsd> GetDistinctDlsdObjects(AxApp axApp, OperationType operationType)
{
if (operationType == OperationType.Assessment)
{
string query = $"SELECT DISTINCT(clipid) from {axApp.dlname}";
using (var connection = _dbConnectionFactory.GetAxDbConnection())
{
//SqlMapper.Settings.CommandTimeout = 0;
List<AxDlsd> dlsdrecord = new List<AxDlsd>();
return connection.Query<AxDlsd>(query, commandTimeout: 0, buffered: false ).ToList();
}
}
You can do a SELECT COUNT(DISTINCT clipid) from {axApp.dlname} to get the total and then use that to page
int pageSize = 500000;
for(var page = 0; page < (total / pageSize) + 1; page++)
{
string query = $"SELECT DISTINCT(clipid) from {axApp.dlname} ORDER BY clipid OFFSET {page * pageSize} FETCH NEXT {pageSize} ROWS ONLY";
///...
}
This will allow you to go through 500k rows at a time or whatever you page size is. FETCH/OFFSET does require SQL Server 2012. I'm not sure what SQL you are using.
I created an application in C# using Neo4j.Driver.V1 that reads from a CSV and writes it into a neo4j graph database.
My csv has 1000 records. Each record is split into 5 nodes with relationships between them.
The whole process is taking 1 min 11 seconds (including 1 second for my logic behind to build the query).
This is way too much considering they will be uploading millions of records.
Here is my query:
MERGE
(accountd71d278a8eeb468f9e4517ac1e007fe5:Account
{
number: '952'
} )
ON CREATE
SET accountd71d278a8eeb468f9e4517ac1e007fe5 +=
{
number: '952',
balanceType: 2,
accountType: 2,
openDate: apoc.date.parse('7/9/2015', 'ms', 'm/d/YYYY')
}
MERGE (account13aa03cd1b6d449e88a3e5e5a22353da:Account
{
number: '198'
} )
ON CREATE
SET account13aa03cd1b6d449e88a3e5e5a22353da +=
{
number: '198'
}
MERGE (transactionba1459c4f7854157be237e7365497fcf:Transaction
{
number: '1'
} )
ON CREATE
SET transactionba1459c4f7854157be237e7365497fcf +=
{
number: '1',
amount: 3717.81,
type: 2,
date: apoc.date.parse('2016-05-27', 'ms', 'YYYY-mm-dd')
}
MERGE (bank3679799504f54bed9f079848be9c6eff:Bank
{
code: 'MMBC'
} )
ON CREATE
SET bank3679799504f54bed9f079848be9c6eff +=
{
code: 'MMBC',
country: 'Mongolia'
}
MERGE (bank522b6b6ed04d40bd9d87d4ecc36fbde2:Bank
{
code: 'VALL'
} )
ON CREATE
SET bank522b6b6ed04d40bd9d87d4ecc36fbde2 +=
{
code: 'VALL',
country: 'Mongolia'
}
MERGE (accountd71d278a8eeb468f9e4517ac1e007fe5)-[:credits]->(transactionba1459c4f7854157be237e7365497fcf)
MERGE (accountd71d278a8eeb468f9e4517ac1e007fe5)-[:residesWith]->(bank3679799504f54bed9f079848be9c6eff)
MERGE (transactionba1459c4f7854157be237e7365497fcf)-[:debits]->(account13aa03cd1b6d449e88a3e5e5a22353da)
MERGE (account13aa03cd1b6d449e88a3e5e5a22353da)-[:residesWith]->(bank522b6b6ed04d40bd9d87d4ecc36fbde2)
Any ideas how I can reduce the time of my query?
Before offering any ideas, here is what I tried already:
Removing the long names with GUID
Remove use of apoc date parse
Considered using the import from csv in-build functionality but the db is on another server
Combined multiple record queries (and resulted that 2 at once performs best)
Created constraints
Thanks in advance!
K
This is list of point to optimize your process :
Use query parameters : All your query data should be a parameter. If you do it, Neo4j will not recompute every time the query planner
Batch your queries : I think you do one transaction for each row of your CSV. Try to bacth your queries (one transaction for 1000 row should be OK, but if your CSV will grow, you will really need more transactions)
Create one query per node/relationship creation instead of doing one big query, and for the relation use the MATCH MATCH MERGE pattern (you have the constraint, so it will be fast)
My query is like
var query = dbContext.table1.join(dbcontext.table2,i=>i.table1.id,j=>j.table2.id,
(i,j)=>new {
name = i.name,
hours = (new decimal?[]{ j.day1,j.day2,j.day3}.Sum()),
total = ???????
}).ToArray();
In the hours field I am getting the values of individual user's working hours for three days. In the "total" field I want to display the sum of all users' "hours" values.
Can you tell me how to get the "total" value?
var total = query.Sum(x => x.hours);
Since this total is for all rows in the result set, you do not want one value for each row, but one value representing the aggregate of the entire array.
I have to update every row in a Sql Server table with about 150,000 records using entity framework. To reduce the amount of hits the server takes, I would like to do this in separate batches of 1000 rows. I need entity framework to:
Select the first 1000 rows from the DB.
Update those rows.
Call SaveChanges() method.
Get next 1000 rows.
Repeat.
Whats the best way to achieve this?
I'm using entity framework 4 and SQL Server 2012.
Use LINQ Skip & Take:
return query.Skip(HOW MUCH TO SKIP -AT THE BEGINNING WILL BE ZERO-)
.Take(HOW MUCH TO TAKE -THE NUMBER OF YOUR PAGING SIZE-).ToList();
If you want to do it within a loop you can do something like this:
int pagingIncrement = 1000;
for (int i = 0; i <= 150 000; i=i+pagingIncrement)
{
var query = your actual LINQ query.
var results = query.Skip(i).Take(pagingIncrement);
UpdatePartialResults(results);
}
Note: It is important that while updating those rows you don't update the criteria for the ORDER BY within your actual LINQ query, otherwise you could be end up updating the same results again and again (because of the reordering).
Other idea will be to extend the IEnumerable iterator with some of the previously given ideas such as a Skip(counter).Take(pagingSize and yield result (to be processing kinda asynchronously).
something like this should work:
int skip =0;
int take = 1000;
for (int i = 0; i < 150; i++)
{
var rows = (from x in Context.Table
select x).OrderBy(x => x.id).Skip(skip).Take(take).ToList();
//do some update stuff with rows
skip += 1000;
}