I'm experiencing OutOfMemory exceptions in my application, when fetching from the database. It is a C# .Net application using Linq2Sql.
I have tried using GC.GetTotalMemory() to see how much memory is taken up before and after the call to the database. This gives a nice although not quite accurate picture of what is going on. When I look in the Windows Task Manager I can see that the Peak Working Set is not smaller when fetching the data in a paged manner using the following code:
public static void PreloadPaged()
{
int NoPoints = PointRepository.Count();
int pagesize = 50000;
int fetchedRows = 0;
while (fetchedRows < NoPoints)
{
PreloadPointEntity.Points.AddRange(PointRepository.ReadPaged(pagesize, fetchedRows));
PointRepository.ReadPointCollections();
PreloadPointEntity.PointCollections.Count());
fetchedRows += pagesize;
}
}
private static List<PointEntity> ReadPaged(int pagesize, int fetchedRows)
{
DataModel dataContext = InstantiateDataModel();
var Points = (from p in dataContext.PointDatas
select p.ToEntity());
return Points.Skip(fetchedRows).Take(pagesize).ToList();
}
I guess it's the Linq2Sql code that is using up the memory and not reusing it or freeing it afterwards, but what can I do to get the memory foot print down?
I have observed that it uses 10 times as much memory to fetch the data as it does to store them in my list of enties. I have considered invoking the garbage collector, but I would rather avoid it.
you are retrieving way too much data and storing it in memory, that's why you are getting an OOM exception.
1 of 2 things is occurring:
you are loading an excessive amount of data when the user will only view a subset of the results and/or this is a 1st attempt at "caching" data.
you do need all this data, but are using the wrong technology (Linq2Sql) to access the data.
if it's the first, you need to either
load smaller chunks of data (20-50 records, not 50K or everything)
if this is only for display purposes, then query a projection of what's needed, rather than the entity itself.
if it's the second than use an ETL tool designed to manage large amounts of data. I prefer Rhino.ETL but SSIS also works.
Related
I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory.
Considering below are the queries to pull data from source and target:
Sql Server -
select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle -
select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in Source : 12377200
Records in Target : 12266800
Following are the approaches i have tried with some statistics:
1) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:47:25
Time taken by Job1 = 01:48:32
There is no index on Id Column.
Major time is spent here:
var dr = command.ExecuteReader();
Problems:
There are timeout issues also for which i have to kept commandtimeout to 0(infinity) and it is bad.
2) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 02:02:48
There is no index on Id Column.
3) Chunk by chunk reading approach for reading source and target data:
Total jobs = 1
Chunk size : 100000
Time Taken : 00:39:40
Index is present on Id column.
4) open data reader approach for reading source and target data:
Total jobs = 1
Index : Yes
Time: 00:01:43
5) open data reader approach for reading source and target data:
Total jobs running in parallel = 3
Index : Yes
Time: 00:25:12
I observed that while having an index on LinkedColumn does improve performance, the problem is we are dealing with a 3rd party RDBMS table which might not have an index.
We would like to keep database server as free as possible so data reader approach doesn't seem like a good idea because there will be lots of jobs running in parallel which will put so much pressure on database server which we don't want.
Hence we want to fetch records in the resource memory from source to target and do 1 - 1 records comparison to keep the database server free.
Note: I want to do this in my c# application and don't want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41
Target Sql Query Execution time in sql server management studio:00:01:40
What will be the best way to read huge result set in memory?
Code:
static void Main(string[] args)
{
// Running 3 jobs in parallel
//Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare()),
//Task<string>.Factory.StartNew(() => Compare())
//};
Compare();//Run single job
Console.ReadKey();
}
public static string Compare()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
var srcConnection = new SqlConnection("Source Connection String");
srcConnection.Open();
var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection);
var tgtConnection = new SqlConnection("Target Connection String");
tgtConnection.Open();
var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection);
var drA = GetReader(command1);
var drB = GetReader(command2);
stopwatch.Stop();
string a = stopwatch.Elapsed.ToString(#"d\.hh\:mm\:ss");
Console.WriteLine(a);
return a;
}
private static IDataReader GetReader(SqlCommand command)
{
command.CommandTimeout = 0;
return command.ExecuteReader();//Culprit
}
There is nothing (I know of) faster than a DataReader for fetching db records.
Working with large databases comes with its challenges, reading 10 million records in under 2 seconds is pretty good.
If you want faster you can:
jdwend's suggestion:
Use sqlcmd.exe and the Process class to run query and put results into a csv file and then read the csv into c#. sqlcmd.exe is designed to archive large databases and runs 100x faster than the c# interface. Using linq methods are also faster than the SQL Client class
Parallize your queries and fetch concurrently merging results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/
The easiest (and IMO the best for a SELECT * all) is to throw hardware at it:
https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/
Also make sure you're testing on the PROD hardware, in release mode as that could skew your benchmarks.
This is a pattern that I use. It gets the data for a particular record set into a System.Data.DataTable instance and then closes and disposes all un-managed resources ASAP. Pattern also works for other providers under System.Data include System.Data.OleDb, System.Data.SqlClient, etc. I believe the Oracle Client SDK implements the same pattern.
// don't forget this using statements
using System.Data;
using System.Data.SqlClient;
// here's the code.
var connectionstring = "YOUR_CONN_STRING";
var table = new DataTable("MyData");
using (var cn = new SqlConnection(connectionstring))
{
cn.Open();
using (var cmd = cn.CreateCommand())
{
cmd.CommandText = "Select [Fields] From [Table] etc etc";
// your SQL statement here.
using (var adapter = new SqlDataAdapter(cmd))
{
adapter.Fill(table);
} // dispose adapter
} // dispose cmd
cn.Close();
} // dispose cn
foreach(DataRow row in table.Rows)
{
// do something with the data set.
}
I think I would deal with this problem in a different way.
But before lets make some assumptions:
According to your question description, you will get data from SQL Server and Oracle
Each query will return a bunch of data
You do not specify what is the point of getting all that data in memory, neither the use of it.
I assume that the data you will process is going to be used multiple times and you will not repeat both queries multiple times.
And whatever you will do with the data, probably is not going to be displayed to the user all at the same time.
Having these foundation points I would process the following:
Think at this problem as a data processing
Have a third database or some other place with auxiliar Database tables where you can store all the result of the 2 queries.
To avoid timeouts or so, try to obtain the data using pagging (get thousands at a time) and save then in these aux DB tables, and NOT in "RAM" memory.
As soon as your logic completes all the data loading (import migration), then you can start processing it.
Data processing is a key point of database engines, they are efficient and lots of evolution during many years, do don't spend time reinventing the wheel. Use some Stored procedure to "crunch/process/merge" of the 2 auxiliary tables into only 1.
Now that you have all "merged" data in a 3th aux table, now you can use it to display or something else you need to use it.
If you want to read it faster, you must use original API to get the data faster. Avoid framework like linq and do rely on DataReader that one. Try to check weather you need something like dirty read (with(nolock) in sql server).
If your data is very huge, try to implement partial read. Something like making index to your data. Maybe you can put condition where date from - to until everything selected.
After that you must consider using Threading in your system to parallelize the flow. Actually 1 thread to get from job 1, another thread to get from job 2. This one will cut lot of time.
Technicalities aside, I think there is a more fundamental problem here.
select [...] order by LinkedColumn
I does observe that while having index on LinkedColumn does improve performance but the problem is we are dealing with 3rd party RDBMS tables which might have index or might not.
We would like to keep database server as free as possible
If you cannot ensure that the DB has a tree based index on that column, it means the DB will be quite busy sorting your millions of elements. It's slow and resource hungry. Get rid of the order by in the SQL statement and perform it on application side to get results faster and reduce load on DB ...or ensure the DB has such an index!!!
...depending if this fetching is a common or a rare operation, you'll want to either enforce a proper index in the DB, or just fetch it all and sort it client side.
I had a similar situation many years ago. Before I looked at the problem it took 5 days running continuously to move data between 2 systems using SQL.
I took a different approach.
We extracted the data from the source system into just a small number of files representing a flattened out data model and arranged the data in each file so it all naturally flowed in the proper sequence as we read from the files.
I then wrote a Java program that processed these flattened data files and produced individual table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system which turned into 30 to 40 or so load files for the target database.
That process would run in just a few minutes and I incorporated full auditing and error reporting and we could quickly spot problems and discrepancies in the source data, get them fixed, and run the processor again.
The final piece of the puzzle was a multi-threaded utility I wrote that performed a parallel bulk load on each load file into the target Oracle database. This utility created a Java process for each table and used Oracle's bulk table load program to quickly push the data into the Oracle DB.
When all was said and done that 5 day SQL-SQL transfer of millions of records turned into just 30 minutes using a combination of Java and Oracle's bulk load capabilities. And there were no errors and we accounted for every penny of every account that was transferred between systems.
So, maybe think outside the SQL box and use Java, the file system, and Oracle's bulk loader. And make sure you're doing your file IO on solid state hard drives.
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from.
A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified
I am storing images in SQL Server in bytes[] and then retrieving it using VIEWDATA as following (there are nine images (bytes[]) in database which I am retrieving):
Action controller:
public ActionResult show_pics2()
{
using (cygnussolutionEntities6 db = new cygnussolutionEntities6())
{
// db.CommandTimeout = int.MaxValue; //For test
var querylist = (from f in db.Images
select f.ImageContent);
// get list in ViewBag
ViewBag.DataLIst = querylist;
// get list in View Data
ViewData["images"] = querylist.ToList();
return View();
}
}
In view, I am parsing images and displaying it with foreach loop and viewDATA, but it's taking so long to load to browser. Does anyone knows why this is taking so long?
It would be worth putting some timers through your code to see where the slowness is occurring - i.e. is it the query to the database that is slow, is materialising the result into a list slow, or is the slowness occurring within the view when processing the images.
Also worth considering - are you eager-loading any properties from the Images table as that will affect performance depending on the number of extra navigation properties being loaded.
Are you able to post the code within your view and your entities class?
If you are going to ToList() something and also going to use it elsewhere, then make both uses use the results of the ToList() (unless you really need to maintain the queryable interface). Then you need only get that list once.
Don't obtain several blobs and use them on a page, obtain several IDs and use them to call other resources through <img src="theImage/#id"> or similar, then have that resource served by a view that only retrieves the sole image. Then each such resource loads a single image and can do so in parallel to each other.
Unless the images are all small, use streams that build on ADO access to the blob and stream out a chunk of 4096 bytes at a time, rather than EF. While it means leaving a lot of what EF give you (and a lot of what MVC gives you since you'll have to do it in a result rather than a view) it allows for memory-efficient streaming that can begin with the first loaded chunk.
I have an Excel file that originally had about 200 rows, and I was able to convert the excel file to a data table and everything got inserted into the documentdb correctly.
The Excel file now has 5000 rows and it is not inserting after 30-40 records insertion and rest of all the rows are not inserted into the documentdb
I found some exception as below.
Microsoft.Azure.Documents.DocumentClientException: Exception:
Microsoft.Azure.Documents.RequestRateTooLargeException, message:
{"Errors":["Request rate is large"]}
My code is :
Service service = new Service();
foreach(data in exceldata) //exceldata contains set of rows
{
var student = new Student();
student.id= "";
student.name = data.name;
student.age = data.age;
student.class = data.class;
student.id = service.savetoDocumentDB(collectionLink,student); //collectionlink is a string stored in web.config
students.add(student);
}
Class Service
{
public async Task<string> AddDocument(string collectionLink, Student data)
{
this.DeserializePayload(data);
var result = await Client.CreateDocumentAsync(collectionLink, data);
return result.Resource.Id;
}
}
Am I doing anything wrong?
Any help would be greatly appreciable.
Update:
As of 4/8/15, DocumentDB has released a data import tool, which supports JSON files, MongoDB, SQL Server, and CSV files. You can find it here: http://www.microsoft.com/en-us/download/details.aspx?id=46436
In this case, you can save your Excel file as a CSV and then bulk-import records using the data import tool.
Original Answer:
DocumentDB Collections are provisioned 2,000 request-units per second. It's important to note - the limits are expressed in terms of request-units and not requests; so writing larger documents costs more than smaller documents, and scanning is more expensive than index seeks.
You can measure the overhead of any operations (CRUD) by inspecting the x-ms-request-charge HTTP response header or the RequestCharge property in the ResourceResponse/FeedResponse objects returned by the SDK.
A RequestRateTooLargeException is thrown when you exhaust the provisioned throughput. Some solutions include:
Back off w/ a short delay and retry whenever you encounter the exception. A recommended retry delay is included in the x-ms-retry-after-ms HTTP response header. Alternatively, you could simply batch requests with a short delay
Use lazy indexing for faster ingestion rate. DocumentDB allows you to specify indexing policies at the collection level. By default, the index is updated synchronously on each write to the collection. This enables the queries to honor the same consistency level as that of the document reads without any delay for the index to “catch up”. Lazy indexing can be used to amortize the work required to index content over a longer period of time. It is important to note, however, that when lazy indexing is enabled, query results will be eventually consistent regardless of the consistency level configured for the DocumentDB account.
As mentioned, each collection has a limit of 2,000 RUs - you can increase throughput by sharding / partitioning your data across multiple collections and capacity units.
Delete empty collections to utilize all provisioned throughput - every document collection created in a DocumentDB account is allocated reserved throughput capacity based on the number of Capacity Units (CUs) provisioned, and the number of collections created. A single CU makes available 2,000 request units (RUs) and supports up to 3 collections. If only one collection is created for the CU, the entire CU throughput will be available for the collection. Once a second collection is created, the throughput of the first collection will be halved and given to the second collection, and so on. To maximize throughput available per collection, I'd recommend the number of capacity units to collections is 1:1.
References:
DocumentDB Performance Tips:
http://azure.microsoft.com/blog/2015/01/27/performance-tips-for-azure-documentdb-part-2/
DocumentDB Limits:
http://azure.microsoft.com/en-us/documentation/articles/documentdb-limits/
I'm working on a project where we're receiving data from multiple sources, that needs to be saved into various tables in our database.
Fast.
I've played with various methods, and the fastest I've found so far is using a collection of TableValue parameters, filling them up and periodically sending them to the database via a corresponding collection of stored procedures.
The results are quite satisfying. However, looking at disk usage (% Idle Time in Perfmon), I can see that the disk is getting periodically 'thrashed' (a 'spike' down to 0% every 13-18 seconds), whilst in between the %Idle time is around 90%. I've tried varying the 'batch' size, but it doesn't have an enormous influence.
Should I be able to get better throughput by (somehow) avoiding the spikes while decreasing the overall idle time?
What are some things I should be looking out to work out where the spiking is happening? (The database is in Simple recovery mode, and pre-sized to 'big', so it's not the log file growing)
Bonus: I've seen other questions referring to 'streaming' data into the database, but this seems to involve having a Stream from another database (last section here). Is there any way I could shoe-horn 'pushed' data into that?
A very easy way of inserting loads of data into an SQL-Server is -as mentioned- the 'bulk insert' method. ADO.NET offers a very easy way of doing this without the need of external files. Here's the code
var bulkCopy = new SqlBulkCopy(myConnection);
bulkCopy.DestinationTableName = "MyTable";
bulkCopy.WriteToServer (myDataSet);
That's easy.
But: myDataSet needs to have exactly the same structure as MyTable, i.e. Names, field types and order of fields must be exactly the same. If not, well there's a solution to that. It's column mapping. And this is even easier to do:
bulkCopy.ColumnMappings.Add("ColumnNameOfDataSet", "ColumnNameOfTable");
That's still easy.
But: myDataSet needs to fit into memory. If not, things become a bit more tricky as we have need a IDataReader derivate which allows us to instantiate it with an IEnumerable.
You might get all the information you need in this article.
Building on the code referred to in alzaimar's answer, I've got a proof of concept working with IObservable (just to see if I can). It seems to work ok. I just need to put together some tidier code to see if this is actually any faster than what I already have.
(The following code only really makes sense in the context of the test program in code download in the aforementioned article.)
Warning: NSFW, copy/paste at your peril!
private static void InsertDataUsingObservableBulkCopy(IEnumerable<Person> people,
SqlConnection connection)
{
var sub = new Subject<Person>();
var bulkCopy = new SqlBulkCopy(connection);
bulkCopy.DestinationTableName = "Person";
bulkCopy.ColumnMappings.Add("Name", "Name");
bulkCopy.ColumnMappings.Add("DateOfBirth", "DateOfBirth");
using(var dataReader = new ObjectDataReader<Person>(people))
{
var task = Task.Factory.StartNew(() =>
{
bulkCopy.WriteToServer(dataReader);
});
var stopwatch = Stopwatch.StartNew();
foreach(var person in people) sub.OnNext(person);
sub.OnCompleted();
task.Wait();
Console.WriteLine("Observable Bulk copy: {0}ms",
stopwatch.ElapsedMilliseconds);
}
}
It's difficult to comment without knowing the specifics, but one of the fastest ways to get data into SQL Server is Bulk Insert from a file.
You could write the incoming data to a temp file and periodically bulk insert it.
Streaming data into SQL Server Table-Valued parameter also looks like a good solution for fast inserts as they are held in memory. In answer to your question, yes you could use this, you just need to turn your data into a IDataReader. There's various ways to do this, from a DataTable for example see here.
If your disk is a bottleneck you could always optimise your infrastructure. Put database on a RAM disk or SSD for example.
I am working the a very large data set, roughly 2 million records. I have the code below but get an out of memory exception after it has process around three batches, about 600,000 records. I understand that as it loops through each batch entity framework lazy loads, which is then trying to build up the full 2 million records into memory. Is there any way to unload the batch one I've processed it?
ModelContext dbContext = new ModelContext();
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.OrderBy(t => t.TownID).Batch(200000);
foreach (var batch in towns)
{
SearchClient.Instance.IndexMany(batch, SearchClient.Instance.Settings.DefaultIndex, "Town", new SimpleBulkParameters() { Refresh = false });
}
Note: The Batch method comes from this project: https://code.google.com/p/morelinq/
The search client is this: https://github.com/Mpdreamz/NEST
The issue is that when you get data from EF there are actually two copies of the data created, one which is returned to the user and a second which EF holds onto and uses for change detection (so that it can persist changes to the database). EF holds this second set for the lifetime of the context and its this set thats running you out of memory.
You have 2 options to deal with this
renew your context each batch
Use .AsNoTracking() in your query eg:
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.AsNoTracking().OrderBy(t => t.TownID).Batch(200000);
this tells EF not to keep a copy for change detection. You can read a little more about what AsNoTracking does and the performance impacts of this on my blog: http://blog.staticvoid.co.nz/2012/4/2/entity_framework_and_asnotracking
I wrote a migration routine that reads from one DB and writes (with minor changes in layout) into another DB (of a different type) and in this case, renewing the connection for each batch and using AsNoTracking() did not cut it for me.
Note that this problem occurs using a '97 version of JET. It may work flawlessly with other DBs.
However, the following algorithm did solve the Out-of-memory issue:
use one connection for reading and one for writing/updating
Read with AsNoTracking()
every 50 rows or so written/updated, check the memory usage, recover memory + reset output DB context (and connected tables) as needed:
var before = System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
if (before > 800000000)
{
dbcontextOut.SaveChanges();
dbcontextOut.Dispose();
GC.Collect();
GC.WaitForPendingFinalizers();
dbcontextOut = dbcontextOutFunc();
tableOut = Dynamic.InvokeGet(dbcontextOut, outputTableName);
}