Optimize neo4j cypher import query - c#

I created an application in C# using Neo4j.Driver.V1 that reads from a CSV and writes it into a neo4j graph database.
My csv has 1000 records. Each record is split into 5 nodes with relationships between them.
The whole process is taking 1 min 11 seconds (including 1 second for my logic behind to build the query).
This is way too much considering they will be uploading millions of records.
Here is my query:
MERGE
(accountd71d278a8eeb468f9e4517ac1e007fe5:Account
{
number: '952'
} )
ON CREATE
SET accountd71d278a8eeb468f9e4517ac1e007fe5 +=
{
number: '952',
balanceType: 2,
accountType: 2,
openDate: apoc.date.parse('7/9/2015', 'ms', 'm/d/YYYY')
}
MERGE (account13aa03cd1b6d449e88a3e5e5a22353da:Account
{
number: '198'
} )
ON CREATE
SET account13aa03cd1b6d449e88a3e5e5a22353da +=
{
number: '198'
}
MERGE (transactionba1459c4f7854157be237e7365497fcf:Transaction
{
number: '1'
} )
ON CREATE
SET transactionba1459c4f7854157be237e7365497fcf +=
{
number: '1',
amount: 3717.81,
type: 2,
date: apoc.date.parse('2016-05-27', 'ms', 'YYYY-mm-dd')
}
MERGE (bank3679799504f54bed9f079848be9c6eff:Bank
{
code: 'MMBC'
} )
ON CREATE
SET bank3679799504f54bed9f079848be9c6eff +=
{
code: 'MMBC',
country: 'Mongolia'
}
MERGE (bank522b6b6ed04d40bd9d87d4ecc36fbde2:Bank
{
code: 'VALL'
} )
ON CREATE
SET bank522b6b6ed04d40bd9d87d4ecc36fbde2 +=
{
code: 'VALL',
country: 'Mongolia'
}
MERGE (accountd71d278a8eeb468f9e4517ac1e007fe5)-[:credits]->(transactionba1459c4f7854157be237e7365497fcf)
MERGE (accountd71d278a8eeb468f9e4517ac1e007fe5)-[:residesWith]->(bank3679799504f54bed9f079848be9c6eff)
MERGE (transactionba1459c4f7854157be237e7365497fcf)-[:debits]->(account13aa03cd1b6d449e88a3e5e5a22353da)
MERGE (account13aa03cd1b6d449e88a3e5e5a22353da)-[:residesWith]->(bank522b6b6ed04d40bd9d87d4ecc36fbde2)
Any ideas how I can reduce the time of my query?
Before offering any ideas, here is what I tried already:
Removing the long names with GUID
Remove use of apoc date parse
Considered using the import from csv in-build functionality but the db is on another server
Combined multiple record queries (and resulted that 2 at once performs best)
Created constraints
Thanks in advance!
K

This is list of point to optimize your process :
Use query parameters : All your query data should be a parameter. If you do it, Neo4j will not recompute every time the query planner
Batch your queries : I think you do one transaction for each row of your CSV. Try to bacth your queries (one transaction for 1000 row should be OK, but if your CSV will grow, you will really need more transactions)
Create one query per node/relationship creation instead of doing one big query, and for the relation use the MATCH MATCH MERGE pattern (you have the constraint, so it will be fast)

Related

How to optimize C# mongodb query on large datadases?

I have a database table having 100 million records. Screen Shot is taken from Robomongo
Table Schema: There are 100 million records
When I run the following code. I get results, but It takes around 1 minute to get completed. I need to optimize the query to get results faster. What I have done till now is here. Please tell me the way forward to achieve the optimized result.
var collection = _database.GetCollection<BsonDocument>("FloatTable1");
var sw = Stopwatch.StartNew();
var builder = Builders<BsonDocument>.Filter;
int min = Convert.ToInt32(textBox13.Text); //3
int max = Convert.ToInt32(textBox14.Text); //150
var filt = builder.Gt("Value", min) & builder.Lt("Value", max);
var list = collection.Find(filt);
sw.Stop();
TimeSpan time = sw.Elapsed;
Console.WriteLine("Time to Fetch Record: " + time.ToString());
var sw1 = Stopwatch.StartNew();
var list1 = list.ToList();
sw1.Stop();
TimeSpan time1 = sw1.Elapsed;
Console.WriteLine("Time to Convert var to List: " + time1.ToString());
Console.WriteLine("Total Count in List: " + list1.Count.ToString());
Out Put is:
Time to Fetch Record: 00:00:00.0059207
Time to Convert var to List: 00:01:00.7209163
Total Count in List: 1003154
I have few question related to the given code.
When line collection.Find(filt) executes, does it fetch filtered record from the database OR Just creating filter?
var list1 = list.ToList(); takes 1 minute to execute, is it only converting from var to list OR First fetching data than converting?
How to achieve this query and result in least possible time. Please Help.
When line collection.Find(filt) executes, does it fetch filtered
record from the database OR Just creating filter?
It is just creating the filter.
var list1 = list.ToList(); takes 1 minute to execute, is it only
converting from var to list OR First fetching data than converting?
It is fetching the data and converting.
How to achieve this query and result in least possible time. Please Help.
The fetch / filtering on the database is eating your time. The easiest way to speed it up would be creating an index on the column you are filtering.
Everything else would need some more effort or database technologies, like creating a column which more roughly presents your date (e.g. grouped by day) and indexing this one, or creating something like table sections grouped by a given timespan (I'm not a DB-Admin and don't know the proper terms for this, but I remember somebody doing it on a database with billions of records ;) )

get and post table type using pqxx

Good day, friends. I'm using pqxx and I have some questions.
1. I have two tables:
table1 (table1_id integer) and table2 (table1_id integer, another_id integer) with relation one-to-many.
How I can easy get information in view like: table1_id, vector another_ids?
Now I use serializtion in script (string concat into "%d %d %d...") and deserialization in my c++ code.
Also I need insert value into table1. And how I can do this in one transaction?
2. I call stored procedure like
t.exec("SELECT * FROM my_proc(some_argument)");
May be exists any way to do this like in c#?
Thank you very much!
So, may be it can help someone.
In first case I find and use two ways:
1. Group concat in sql function and deserialization in c++. It fast if table2 has only table1_id and another integer.
2. I call two function: get_table1() and get_table2() with order by table1_id. And then with two pointers I create array of table1:
std::vector<Table1> table1Array;
auto ap = wrk.prepared(GetTable1FuncName).exec();
auto aps = wrk.prepared(GetTable2FuncName).exec();
auto pos = aps.begin();
for (auto row = ap.begin(); row != ap.end(); ++row) {
std::vector<Table2> table2Array;
while (pos != aps.end()
&& row["table1_id"].as(int()) == pos["table1_id"].as(int())) {
table2Array.push_back(Table2(pos["first_id"].as(int()),
pos["second_string"].as(std::string())));
++pos;
}
Table1 tb1(row["table1_id"].as(int()), row["column2"].as(int()),
row["column3"].as(int()), row["column4"].as(int()),
table2Array);
table1Array.push_back(tb1);
}
May be it is not pretty, but it's working.
Insert into database I write for one element. Firstly insert into Table1, and after several lines into Table2. After call pqxx::work.commit().
In second case Not, doesn't exists. Also remember, function always return 1 line! Be careful!

Insert into 120 columns from 120-indexed array

I have column names like this
Id
,Test
,[H01_1]
,[H01_2]
,[H01_3]
,[H01_4]
,[H01]
,[H02_1]
,[H02_2]
,[H02_3]
,[H02_4]
,[H02]
,[H03_1]
,[H03_2]
,[H03_3]
,[H03_4]
,[H03]
,[H04_1]
,[H04_2]
,[H04_3]
,[H04_4]
,[H04]
,[H05_1]
,[H05_2]
,[H05_3]
,[H05_4]
,[H05]
,[H06_1]
,[H06_2]
,[H06_3]
,[H06_4]
,[H06]
,[H07_1]
,[H07_2]
,[H07_3]
,[H07_4]
,[H07]
,[H08_1]
,[H08_2]
,[H08_3]
,[H08_4]
,[H08]
,[H09_1]
,[H09_2]
,[H09_3]
,[H09_4]
,[H09]
,[H10_1]
,[H10_2]
,[H10_3]
,[H10_4]
,[H10]
,[H11_1]
,[H11_2]
,[H11_3]
,[H11_4]
,[H11]
,[H12_1]
,[H12_2]
,[H12_3]
,[H12_4]
,[H12]
,[H13_1]
,[H13_2]
,[H13_3]
,[H13_4]
,[H13]
,[H14_1]
,[H14_2]
,[H14_3]
,[H14_4]
,[H14]
,[H15_1]
,[H15_2]
,[H15_3]
,[H15_4]
,[H15]
,[H16_1]
,[H16_2]
,[H16_3]
,[H16_4]
,[H16]
,[H17_1]
,[H17_2]
,[H17_3]
,[H17_4]
,[H17]
,[H18_1]
,[H18_2]
,[H18_3]
,[H18_4]
,[H18]
,[H19_1]
,[H19_2]
,[H19_3]
,[H19_4]
,[H19]
,[H20_1]
,[H20_2]
,[H20_3]
,[H20_4]
,[H20]
,[H21_1]
,[H21_2]
,[H21_3]
,[H21_4]
,[H21]
,[H22_1]
,[H22_2]
,[H22_3]
,[H22_4]
,[H22]
,[H23_1]
,[H23_2]
,[H23_3]
,[H23_4]
,[H23]
,[H24_1]
,[H24_2]
,[H24_3]
,[H24_4]
,[H24]
And I am trying to write a simple INSERT with dapper (SQL Server 2014).
For the Id and Test I am writing an anonymous object to put into the param but I wasn't sure whats the best way to take a 120 length int? array and insert it into the columns beginning with H
Where index 0 goes to H01_1 and index 1 goes to H01_2 ... etc
I don't want to have to write SQL that says
H01_1 = #H01_1,
H01_2 = #H01_2,
...
And then also have to make an anonymous object that does
H01_1 = array[0],
H01_2 = array[1],
...
One thing I can do is insert just Id and Test and then go back and do an UPDATE on that record. But I am still in the same scenario I was before where I don't know the best way to write an UPDATE in dapper without writing things out 120 times.
If it is possible to change your datatable structure then only follow below design
In your table you have more number of column and find out that your assigned ID as primary . So instead of using above data structure use below ..
ID test column value
01 xyz H01_1 val_H01_1
01 xyz H01_2 val_H01_2
Assign group primary key for ID,test and column..
If it is not possible for changing structure . Then create a XML from your front end data and create Stored procedure like below to excute it . If you go through with http://www.itworld.com/article/2960645/development/tsql-how-to-use-xml-parameters-in-stored-procedures.html then you will get idea.

DataTable sum duplicate rows

i have DataTable which can have semi-duplicate rows. In the picture example two highlighted rows have all the same values but the amounts in 'Amount' columns. I would need to identify those rows and sum the amount $. Data comes from text file and there is no key to uniquely identify rows.
I looked at some answers like this one
Best way to remove duplicate entries from a data table but in my case would need to match on not just one column but 10.
Also I tried different LINQ queries but was not successful in getting far.
This is the SQL query which does the job:
SELECT [Date1],[Date2],[Date3]
,SUM([Amount1]) as Summary1
,SUM([Amount2]) as Summary2
,SUM([Amount3]) as Summary3
,[col1],[col2],[Rate1],[Rate2],[Rate3],[product],[comment]
FROM [Table]
group by [Date1],[Date2],[Date3],[col1],[col2],[Rate1],[Rate2],[Rate3],[product],[comment]
EDIT: Just to clarify, SQL query was an example of how I could get a successful results if I was querying SQL table.
This is how you would do it with EF:
var result=db.Table
.GroupBy(r=>new {
r.Date1,
r.Date2,
r.Date3,
r.col1,
r.col2,
r.Rate1,
r.Rate2,
r.Rate3,
r.product,
r.comment},
p=>new {
p.Amount1,
p.Amount2,
p.Amount3},
(key,vals)=>new {
key.Date1,
key.Date2,
key.Date3,
Amount1=vals.Sum(v=>v.Amount1),
Amount2=vals.Sum(v=>v.Amount2),
Amount3=vals.Sum(v=>v.Amount3),
key.col1,
key.col2,
key.Rate1,
key.Rate2,
key.Rate3,
key.product,
key.comments}
);
Slight modification if you are actually doing it from a dataTable:
var result=dt.AsEnumerable()
.GroupBy(d=>new {
Date1=d.Field<datetime>("Date1"),
Date2=d.Field<datetime>("Date2"),
Date3=d.Field<datetime>("Date3"),
col1=d.Field<string>("col1"),
col2=d.Field<string>("col2"),
Rate1=d.Field<decimal>("Rate1"),
Rate2=d.Field<decimal>("Rate2"),
Rate3=d.Field<decimal>("Rate3"),
product=d.Field<string>("product"),
comments=d.Field<string>("comments")},
p=>new {
Amount1=p.Field<decimal>("Amount1"),
Amount2=p.Field<decimal>("Amount2"),
Amount3=p.Field<decimal>("Amount3")},
(key,vals)=>new {
key.Date1,
key.Date2,
key.Date3,
Amount1=vals.Sum(v=>v.Amount1),
Amount2=vals.Sum(v=>v.Amount2),
Amount3=vals.Sum(v=>v.Amount3),
key.col1,
key.col2,
key.Rate1,
key.Rate2,
key.Rate3,
key.product,
key.comments}
);
If you need the result as a datatable, you can use one of the many List To Datatable extension methods out there. Just add ".ToList().AsDataTable()" at the end of the above query to get it back into a datatable.

Combine 3 different datatables into 1 and performance with SQL

I was asked to do a report that combines 3 different crystal reports that we use. Already those reports are very slow and heavy and making 1 big one was out of the question. SO I created a little apps in VS 2010.
My main problem is this, I have 3 Datatable (same schema) that were created with the Dataset designer that I need to combine. I created an empty table to store the combined value. The queries are already pretty big so combining them in a SQL query is really out of the question.
Also I do not have write access to the SQL server (2005), because the server is maintained by the company that created our MRP program. Although I could always ask support to add a view to the server.
So my 3 datatable consist of Labor Cost, Material Cost and subcontracting Cost. I need to create a total cost table that adds all of the Cost column of each table by ID. All the table have keys to find and select them.
The problem is that when i fetch all of the current job it is ok (500ms for 400 records), because I have a query that will fetch only the working job. Problem is with Inventory, since I do not know since when those Job were finished I have to fetch the entire database (around 10000 jobs with subqueries that each have up to 100 records) and this for my 3 tables. This takes around 5000 to 8000ms, although it is very fast compared to the crystal report there is one problem.
I need to create a summary table that will combine all these different tables I created, But I also need to do them 2 times, 1 time for each date that is outputted. So my data always changes, because they are based on a Date parameter. Right now it will take around 12-20sec to fetch them all.
I need a way to reduce the load time, here is what I tried.
Tried a for loop to combine the 3 tables
Then tried with the DataReader class to read each line and used the FindByKey methods that the dataset designer created to find the value in the other table, and I have to do this 2 time. (it seems to go a little bit faster than the for loop)
Tried with Linq, don't think it is possible, and will it give more performance?
Tried to do a dynamic query that use "WHERE IN Comma Separated List" (that actually doubled the time of execution, compared to fetching all of the database)
Tried to join my Inventory query to the my Cost queries (that also increased the time it took)
1 - So is there any way to combine my tables more effectively? What is the fastest way to Merge and Sum my records of my 3 tables?
2 - Is there any way to increase performance of my queries without having write access to the server?
Below is some of the code I used for reference :
public static void Fill()
{
DateTime Date = Data.Date;
AllieesDBTableAdapters.CoutMatTableAdapter mat = new AllieesDBTableAdapters.CoutMatTableAdapter();
AllieesDBTableAdapters.CoutLaborTableAdapter lab = new AllieesDBTableAdapters.CoutLaborTableAdapter();
AllieesDBTableAdapters.CoutSTTableAdapter st = new AllieesDBTableAdapters.CoutSTTableAdapter();
Data.allieesDB.CoutTOT.Clear();
//Around 2 sec each Fill
mat.FillUni(Data.allieesDB.CoutMat, Date);
Data.allieesDB.CoutMat.CopyToDataTable(Data.allieesDB.CoutTOT, LoadOption.OverwriteChanges);
lab.FillUni(Data.allieesDB.CoutLabor, Date);
MergeTable(Data.allieesDB.CoutLabor);
st.FillUni(Data.allieesDB.CoutST, Date);
MergeTable(Data.allieesDB.CoutST);
}
Here is the MergeTable Methods (The For loop I tried is in Comment)
private static void MergeTable(DataTable Table)
{
AllieesDB.CoutTOTDataTable dtTOT = Data.allieesDB.CoutTOT;
DataTableReader r = new DataTableReader(Table);
while (r.Read())
{
DataRow drToT = dtTOT.FindByWO(r.GetValue(2).ToString());
if (drToT != null)
{
drToT["Cout"] = (decimal)drToT["Cout"] + (decimal)r.GetValue(3);
} else
{
EA_CoutsDesVentes.AllieesDB.CoutTOTRow row = dtTOT.NewCoutTOTRow();
for (int j = 0; j < r.FieldCount; j++)
{
if (r.GetValue(j) != null)
{
row[j] = r.GetValue(j);
} else
{
row[j] = null;
}
}
dtTOT.AddCoutTOTRow(row);
}
Application.DoEvents();
}
//try
//{
// for (int i = 0; i < Table.Rows.Count; i++)
// {
// DataRow drSource = Table.Rows[i];
// DataRow drToT = dtTOT.FindByWO(drSource["WO"].ToString());
//if (drToT != null)
//{
// drToT["Cout"] = (decimal)drToT["Cout"] + (decimal)drSource["Cout"];
//} else
//{
//
// EA_CoutsDesVentes.AllieesDB.CoutTOTRow row = dtTOT.NewCoutTOTRow();
// for (int j = 0; j < drSource.Table.Columns.Count; j++)
// {
// if (drSource[j] != null)
// {
// row[j] = drSource[j];
// } else
// {
// row[j] = null;
// }
// }
// dtTOT.AddCoutTOTRow(row);
//}
//Application.DoEvents();
// }
//} catch (Exception)
//{
//}
On Sql Server 2005 and up, you can create a materialized view of the aggregate values and dramatically speed up the performance.
look at Improving Performance with SQL Server 2005 Indexed Views

Categories