C# OutOfMemory, Mapped Memory File or Temp Database

C# OutOfMemory, Mapped Memory File or Temp Database - c#

Seeking some advice, best practice etc...
Technology: C# .NET4.0, Winforms, 32 bit
I am seeking some advice on how I can best tackle large data processing in my C# Winforms application which experiences high memory usage (working set) and the occasional OutOfMemory exception.
The problem is that we perform a large amount of data processing "in-memory" when a "shopping-basket" is opened. In simplistic terms when a "shopping-basket" is loaded we perform the following calculations;
For each item in the "shopping-basket" retrieve it's historical price going all the way back to the date the item first appeared in-stock (could be two months, two years or two decades of data). Historical price data is retrieved from text files, over the internet, any format which is supported by a price plugin.
For each item, for each day since it first appeared in-stock calculate various metrics which builds a historical profile for each item in the shopping-basket.
The result is that we can potentially perform hundreds, thousand and/or millions of calculations depending upon the number of items in the "shopping-basket". If the basket contains too many items we run the risk of hitting a "OutOfMemory" exception.
A couple of caveats;
This data needs to be calculated for each item in the "shopping-basket" and the data is kept until the "shopping-basket" is closed.
Even though we perform steps 1 and 2 in a background thread, speed is important as the number of items in the "shopping-basket" can greatly effect overall calculation speed.
Memory is salvaged by the .NET garbage collector when a "shopping-basket" is closed. We have profiled our application and ensure that all references are correctly disposed and closed when a basket is closed.
After all the calculations are completed the resultant data is stored in a IDictionary. "CalculatedData is a class object whose properties are individual metrics calculated by the above process.
Some ideas I've thought about;
Obviously my main concern is to reduce the amount of memory being used by the calculations however the volume of memory used can only be reduced if I
1) reduce the number of metrics being calculated for each day or
2) reduce the number of days used for the calculation.
Both of these options are not viable if we wish to fulfill our business requirements.
Memory Mapped Files
One idea has been to use memory mapped files which will store the data dictionary. Would this be possible/feasible and how can we put this into place?
Use a temporary database
The idea is to use a separate (not in-memory) database which can be created for the life-cycle of the application. As "shopping-baskets" are opened we can persist the calculated data to the database for repeated use, alleviating the requirement to recalculate for the same "shopping-basket".
Are there any other alternatives that we should consider? What is best practice when it comes to calculations on large data and performing them outside of RAM?
Any advice is appreciated....

The easiest solution is a database, perhaps SQLite. Memory mapped files don't automatically become dictionaries, you would have to code all the memory management yourself, and thereby fight with the .net GC system itself for ownership of he data.

If you're interested in trying the memory mapped file approach, you can try it now. I wrote a small native .NET package called MemMapCache that in essence creates a key/val database backed by MemMappedFiles. It's a bit of a hacky concept, but the program MemMapCache.exe keeps all references to the memory mapped files so that if your application crashes, you don't have to worry about losing the state of your cache.
It's very simple to use and you should be able to drop it in your code without too many modifications. Here is an example using it: https://github.com/jprichardson/MemMapCache/blob/master/TestMemMapCache/MemMapCacheTest.cs
Maybe it'd be of some use to you to at least further figure out what you need to do for an actual solution.
Please let me know if you do end up using it. I'd be interested in your results.
However, long-term, I'd recommend Redis.

As an update for those stumbling upon this thread...
We ended up using SQLite as our caching solution. The SQLite database we employ exists separate to the main data store used by the application. We persist calculated data to the SQLite (diskCache) as it's required and have code controlling cache invalidation etc. This was a suitable solution for us as we were able to achieve write speeds up and around 100,000 records per second.
For those interested, this is the code that controls inserts into the diskCache. Full credit for this code goes to JP Richardson (shown answering a question here) for his excellent blog post.
internal class SQLiteBulkInsert
{
#region Class Declarations
private SQLiteCommand m_cmd;
private SQLiteTransaction m_trans;
private readonly SQLiteConnection m_dbCon;
private readonly Dictionary<string, SQLiteParameter> m_parameters = new Dictionary<string, SQLiteParameter>();
private uint m_counter;
private readonly string m_beginInsertText;
#endregion
#region Constructor
public SQLiteBulkInsert(SQLiteConnection dbConnection, string tableName)
{
m_dbCon = dbConnection;
m_tableName = tableName;
var query = new StringBuilder(255);
query.Append("INSERT INTO ["); query.Append(tableName); query.Append("] (");
m_beginInsertText = query.ToString();
}
#endregion
#region Allow Bulk Insert
private bool m_allowBulkInsert = true;
public bool AllowBulkInsert { get { return m_allowBulkInsert; } set { m_allowBulkInsert = value; } }
#endregion
#region CommandText
public string CommandText
{
get
{
if(m_parameters.Count < 1) throw new SQLiteException("You must add at least one parameter.");
var sb = new StringBuilder(255);
sb.Append(m_beginInsertText);
foreach(var param in m_parameters.Keys)
{
sb.Append('[');
sb.Append(param);
sb.Append(']');
sb.Append(", ");
}
sb.Remove(sb.Length - 2, 2);
sb.Append(") VALUES (");
foreach(var param in m_parameters.Keys)
{
sb.Append(m_paramDelim);
sb.Append(param);
sb.Append(", ");
}
sb.Remove(sb.Length - 2, 2);
sb.Append(")");
return sb.ToString();
}
}
#endregion
#region Commit Max
private uint m_commitMax = 25000;
public uint CommitMax { get { return m_commitMax; } set { m_commitMax = value; } }
#endregion
#region Table Name
private readonly string m_tableName;
public string TableName { get { return m_tableName; } }
#endregion
#region Parameter Delimiter
private const string m_paramDelim = ":";
public string ParamDelimiter { get { return m_paramDelim; } }
#endregion
#region AddParameter
public void AddParameter(string name, DbType dbType)
{
var param = new SQLiteParameter(m_paramDelim + name, dbType);
m_parameters.Add(name, param);
}
#endregion
#region Flush
public void Flush()
{
try
{
if (m_trans != null) m_trans.Commit();
}
catch (Exception ex)
{
throw new Exception("Could not commit transaction. See InnerException for more details", ex);
}
finally
{
if (m_trans != null) m_trans.Dispose();
m_trans = null;
m_counter = 0;
}
}
#endregion
#region Insert
public void Insert(object[] paramValues)
{
if (paramValues.Length != m_parameters.Count)
throw new Exception("The values array count must be equal to the count of the number of parameters.");
m_counter++;
if (m_counter == 1)
{
if (m_allowBulkInsert) m_trans = m_dbCon.BeginTransaction();
m_cmd = m_dbCon.CreateCommand();
foreach (var par in m_parameters.Values)
m_cmd.Parameters.Add(par);
m_cmd.CommandText = CommandText;
}
var i = 0;
foreach (var par in m_parameters.Values)
{
par.Value = paramValues[i];
i++;
}
m_cmd.ExecuteNonQuery();
if(m_counter != m_commitMax)
{
// Do nothing
}
else
{
try
{
if(m_trans != null) m_trans.Commit();
}
catch(Exception)
{ }
finally
{
if(m_trans != null)
{
m_trans.Dispose();
m_trans = null;
}
m_counter = 0;
}
}
}
#endregion
}

Related

Creating a thread-safe version of a c# statistics service

I have an API that people are calling and I have a database containing statistics of the number of requests. All API requests are made by a user in a company. There's a row in the database per user per company per hour. Example:
| CompanyId | UserId| Date | Requests |
|-----------|-------|------------------|----------|
| 1 | 100 | 2020-01-30 14:00 | 4527 |
| 1 | 100 | 2020-01-30 15:00 | 43 |
| 2 | 201 | 2020-01-30 14:00 | 161 |
To avoid having to make a database call on every request, I've developed a service class in C# maintaining an in-memory representation of the statistics stored in a database:
public class StatisticsService
{
private readonly IDatabase database;
private readonly Dictionary<string, CompanyStats> statsByCompany;
private DateTime lastTick = DateTime.MinValue;
public StatisticsService(IDatabase database)
{
this.database = database;
this.statsByCompany = new Dictionary<string, CompanyStats>();
}
private class CompanyStats
{
public CompanyStats(List<UserStats> userStats)
{
UserStats = userStats;
}
public List<UserStats> UserStats { get; set; }
}
private class UserStats
{
public UserStats(string userId, int requests, DateTime hour)
{
UserId = userId;
Requests = requests;
Hour = hour;
Updated = DateTime.MinValue;
}
public string UserId { get; set; }
public int Requests { get; set; }
public DateTime Hour { get; set; }
public DateTime Updated { get; set; }
}
}
Every time someone calls the API, I'm calling an increment method on the StatisticsService:
public void Increment(string companyId, string userId)
{
var utcNow = DateTime.UtcNow;
EnsureCompanyLoaded(companyId, utcNow);
var currentHour = new DateTime(utcNow.Year, utcNow.Month, utcNow.Day, utcNow.Hour, 0, 0);
var stats = statsByCompany[companyId];
var userStats = stats.UserStats.FirstOrDefault(ls => ls.UserId == userId && ls.Hour == currentHour);
if (userStats == null)
{
var userStatsToAdd = new UserStats(userId, 1, currentHour);
userStatsToAdd.Updated = utcNow;
stats.UserStats.Add(userStatsToAdd);
}
else
{
userStats.Requests++;
userStats.Updated = utcNow;
}
}
The method loads the company into the cache if not already there (will publish EnsureCompanyLoaded in a bit). It then checks if there is a UserStats object for this hour for the user and company. If not it creates it and set Requests to 1. If other requests have already been made for this user, company, and current hour, it increments the number of requests by 1.
EnsureCompanyLoaded as promised:
private void EnsureCompanyLoaded(string companyId, DateTime utcNow)
{
if (statsByCompany.ContainsKey(companyId)) return;
var currentHour = new DateTime(utcNow.Year, utcNow.Month, utcNow.Day, utcNow.Hour, 0, 0); ;
var userStats = new List<UserStats>();
userStats.AddRange(database.GetAllFromThisMonth(companyId));
statsByCompany[companyId] = new CompanyStats(userStats);
}
The details behind loading the data from the database are hidden away behind the GetAllFromThisMonth method and not important to my question.
Finally, I have a timer that stores any updated results to the database every 5 minutes or when the process shuts down:
public void Tick(object state)
{
var utcNow = DateTime.UtcNow;
var currentHour = new DateTime(utcNow.Year, utcNow.Month, utcNow.Day, utcNow.Hour, 0, 0);
foreach (var companyId in statsByCompany.Keys)
{
var usersToUpdate = statsByCompany[companyId].UserStats.Where(ls => ls.Updated > lastTick);
foreach (var userStats in usersToUpdate)
{
database.Save(GenerateSomeEntity(userStats.Requests));
userStats.Updated = DateTime.MinValue;
}
}
// If we moved into new month since last tick, clear entire cache
if (lastTick.Month != utcNow.Month)
{
statsByCompany.Clear();
}
lastTick = utcNow;
}
I've done some single-threaded testing of the code and the concept seem to work out as expected. Now I want to migrate this to be thread-safe but cannot seem to figure out how to implement it the best way. I've looked at ConcurrentDictionary which might be needed. The main problem isn't on the dictionary methods, though. If two threads call Increment simultaneously, they could both end up in the EnsureCompanyLoaded method. I know of the concepts of lock in C#, but I'm afraid to just lock on every invocation and slow down performance that way.
Anyone needed something similar and have some good pointers in which direction I could go?

When keeping counters in memory like this you have two options:
Keep in memory the actual historic value of the counter
Keep in memory only the differential increment of the counter
I have used both approaches and I've found the second to be simpler, faster and safer. So my suggestion is to stop loading UserStats from the database, and just increment the in-memory counter starting from 0. Then every 5 minutes call a stored procedure that inserts or updates the related database record accordingly (while zero-ing the in-memory value). This way you'll eliminate the race conditions at the loading phase, and you'll ensure that every call to Increment will be consistently fast.
For thread-safety you can use either a normal Dictionary
with a lock, or a ConcurrentDictionary without lock. The first option is more flexible, and the second more efficient. If you choose Dictionary+lock, use the lock only for protecting the internal state of the Dictionary. Don't lock while updating the database. Before updating each counter take the current value from the dictionary and remove the entry in an atomic operation, and then issue the database command while other threads will be able to recreate the entry again if needed. The ConcurrentDictionary class contains a TryRemove method that can be used to achieve this goal without locking:
public bool TryRemove (TKey key, out TValue value);
It also contains a ToArray method that returns a snapshot of the entries in the dictionary. At first glance it seems that the ConcurrentDictionary suits your needs, so you could use it as a basis of your implementation and see how it goes.

To avoid having to make a database call on every request, I've
developed a service class in C# maintaining an in-memory
representation of the statistics stored in a database:
If you want to avoid Update race conditions, you should stop doing exactly that.
Databases by design, by purpose prevent simple update race conditions. This is a simple counting-up operation. A single DML statement. Implicity protected by transactions, journaling and locks. Indeed that is why calling them a lot is costly.
You are fighting the concurrency already there, by adding that service. You are also moving a DB job outside of the DB. And Moving DB jobs outside of the DB, is just going to cause issues.
If your worry is speed:
Please read the Speed Rant.
Maybe a Dsitributed Database Design is the droid you are looking for? They had a massive surge in popularity since Mobile Devices have proliferated, both for speed and reliability reasons.

In general, to make your code thread-safe:
Use concurrent collections, such as ConcurrentDictionary
Make sure to understand concepts such as lock statement, Monitor.Wait and Mintor.PulseAll in tutorials. Locks can be slow if IO operations (such as disk write/read) it being locked on, but for something in RAM it is not necessary to worrry about. If you have really some lengthy operation such as IO or http requests, consider using ConcurrentQueue and learn about the consumer-producer pattern to process work in queues by many workers (example)
You can also try Redis server to cache database without need to design something from zero.
You can also make your service singleton, and update database only after value changes. For reading value, you have already stored it in your service.

Neo4j locks CSV file even after importing

EDIT It seems that locks (mostly/only?) stay locked when a transactional error occurs. I have to restart the database for it to work again, but it is not actively processing anything (no CPU/RAM/HDD activity).
Environment
I have an ASP.NET application that uses the Neo4jClient NuGet package to talk to a Neo4j database. I have N SimpleNode objects that need to be inserted where N can be anything from 100 to 50.000. There are other objects for the M edges that need to be inserted where M can be from 100 to 1.000.000.
Code
Inserting using normal inserts is too slow with 8.000 nodes taking about 80 seconds with the following code:
Client.Cypher
.Unwind(sublist, "node")
.Merge("(n:Node { label: node.label })")
.OnCreate()
.Set("n = node")
.ExecuteWithoutResults();
Therefore I used the import CSV function with the following code:
using (var sw = new StreamWriter(File.OpenWrite("temp.csv")))
{
sw.Write(SimpleNodeModel.Header + "\n");
foreach (var simpleNodeModel in nodes)
{
sw.Write(simpleNodeModel.ToCSVWithoutID() + "\n");
}
}
var f = new FileInfo("temp.csv");
Client.Cypher
.LoadCsv(new Uri("file://" + f.FullName), "csvNode", true)
.Merge("(n:Node {label:csvNode.label, source:csvNode.source})")
.ExecuteWithoutResults();
While still slow, it is an improvement.
Problem
The problem is that the CSV files are locked by the neo4j client (not C# or any of my own code it seems). I would like to overwrite the temp .CSV files so the disk does not fill up, or delete them after use. This is now impossible as the process locks them and I cannot use them. This also means that running this code twice crashes the program, as it cannot write to file the second time.
The nodes are inserted and do appear normally, so it is not the case that it is still working on them. After some unknown and widely varying amount of time, files do seem to unlock.
Question
How can I stop the neo4j client from locking the files long after use? Why does it lock them for so long? Another question: is there a better way of doing this in C#? I am aware of the java importer but I would like my tool to stay in the asp.net environment. It must be possible to insert 8.000 simple nodes within 2 seconds in C#?
SimpleNode class
public class SimpleNodeModel
{
public long id { get; set; }
public string label { get; set; }
public string source { get; set; } = "";
public override string ToString()
{
return $"label: {label}, source: {source}, id: {id}";
}
public SimpleNodeModel(string label, string source)
{
this.label = label;
this.source = source;
}
public SimpleNodeModel() { }
public static string Header => "label,source";
public string ToCSVWithoutID()
{
return $"{label},{source}";
}
}

What is the best way to work with multiple files in multithread in C#?

I am creating a Windows Form application, where I select a folder that contains multiple *.txt files. Their length may vary from few thousand lines (kB) to up to 50 milion lines (1GB). Every line of the code has three informations. Date in long, location id in int and value in float all separated by semicolon (;). I need to calculate min and max value in all those files and tell in which file it is, and then the most frequent value.
I already have these files verified and stored in an arraylist. I am opening a thread to read the files one by one and I read the data by line. It works fine, but when there are 1GB files, I run out of memory. I tried to store the values in dictionary, where key would be the date and the value would be an object that contains all the info loaded from the line alongside with the filename. I see I cannot use a dictionary, because at about 6M values, I ran out of memory. So I should probably do it in multithread. I though I could run two threads, one that reads the file and puts the info in some kind of container and the other that reads from it and makes calculations and then deletes the values from the container. But I don't know which container could do such thing. Moreover I need to calculate the most frequent value, so they need to be stored somewhere which leads me back to some kind of dictionary, but I already know I will run out of memory. I don't have much experience with threads either, so I don't know what is possible. Here is my code so far:
GUI:
namespace STI {
public partial class GUI : Form {
private String path = null;
public static ArrayList txtFiles;
public GUI() {
InitializeComponent();
_GUI1 = this;
}
//I run it in thread. I thought I would run the second
//one here that would work with the values inputed in some container
private void buttonRun_Click(object sender, EventArgs e) {
ThreadDataProcessing processing = new ThreadDataProcessing();
Thread t_process = new Thread(processing.runProcessing);
t_process.Start();
//ThreadDataCalculating calculating = new ThreadDataCalculating();
//Thread t_calc = new Thread(calculating.runCalculation());
//t_calc.Start();
}
}
}
ThreadProcessing.cs
namespace STI.thread_package {
class ThreadDataProcessing {
public static Dictionary<long, object> finalMap = new Dictionary<long, object>();
public void runProcessing() {
foreach (FileInfo file in GUI.txtFiles) {
using (FileStream fs = File.Open(file.FullName.ToString(), FileMode.Open))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs)) {
String line;
String[] splitted;
try {
while ((line = sr.ReadLine()) != null) {
splitted = line.Split(';');
if (splitted.Length == 3) {
long date = long.Parse(splitted[0]);
int location = int.Parse(splitted[1]);
float value = float.Parse(splitted[2], CultureInfo.InvariantCulture);
Entry entry = new Entry(date, location, value, file.Name);
if (!finalMap.ContainsKey(entry.getDate())) {
finalMap.Add(entry.getDate(), entry);
}
}
}
GUI._GUI1.update("File \"" + file.Name + "\" completed\n");
}
catch (FormatException ex) {
GUI._GUI1.update("Wrong file format.");
}
catch (OutOfMemoryException) {
GUI._GUI1.update("Out of memory");
}
}
}
}
}
}
and the object in which I put the values from lines:
Entry.cs
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace STI.entities_package {
class Entry {
private long date;
private int location;
private float value;
private String fileName;
private int count;
public Entry(long date, int location, float value, String fileName) {
this.date = date;
this.location = location;
this.value = value;
this.fileName = fileName;
this.count = 1;
}
public long getDate() {
return date;
}
public int getLocation() {
return location;
}
public String getFileName() {
return fileName;
}
}
}

I don't think that multithreading is going to help you here - it could help you separate the IO-bound tasks from the CPU-bound tasks, but your CPU-bound tasks are so trivial that I don't think they warrant their own thread. All multithreading is going to do is unnecessarily increase the problem complexity.
Calculating the min/max in constant memory is trivial: just maintain a minFile and maxFile variable that gets updated when the current file's value is less-than minFile or greater-than maxFile. Finding the most frequent value is going to require more memory, but with only a few million files you ought to have enough RAM to store a Dictionary<float, int> that maintains the frequency of each value, after which you iterate through the map to determine which value had the highest frequency. If for some reason you don't have enough RAM (make sure that your files are being closed and garbage collected if you're running out of memory, because a Dictionary<float, int> with a few million entries ought to fit in less than a gigabyte of RAM) then you can make multiple passes over the files: on the first pass store the values in a Dictionary<interval, int> where you've split up the interval between MIN_FLOAT and MAX_FLOAT into a few thousand sub-intervals, then on the next pass you can ignore all values that didn't fit into the interval with the highest frequency thus shrinking the dictionary's size. However, the Dictionary<float, int> ought to fit into memory, so unless you start processing billions of files instead of millions of files you probably won't need a multi-pass procedure.

Massive list in c# & wpf

I have a list consisting of all the US Zip codes, each with 3 elements. Thus the list is ~45,000 x 3 strings. What is the best way to load this, essentially the most efficient/optimized? Right now I have a foreach loop running it, and every time it gets to the loading point it hangs. Is there a better approach?
Edit
The usage of this is for the user to be able to type in a zip code and have the city and state displayed in two other text boxes. Right now I have it set to check as the user types, an after the dirt number is entered it freezes up, I believe at the ZipCodes codes = new ZipCodes()
This is the code I'm currently using. I left one of the zipCode.Add statements in, but deleted the other 44,999.
struct ZipCode
{
private String cvZipCode;
private String cvCity;
private String cvState;
public string ZipCodeID { get { return cvZipCode; } }
public string City { get { return cvCity; } }
public string State { get { return cvState; } }
public ZipCode(string zipCode, string city, string state)
{
cvZipCode = zipCode;
cvCity = city;
cvState = state;
}
public override string ToString()
{
return City.ToString() + ", " + State.ToString();
}
}
class ZipCodes
{
private List<ZipCode> zipCodes = new List<ZipCode>();
public ZipCodes()
{
zipCodes.Add(new ZipCode("97475","SPRINGFIELD","OR"));
}
public IEnumerable<ZipCode> GetLocation()
{
return zipCodes;
}
public IEnumerable<ZipCode> GetLocationZipCode(string zipCode)
{
return zipCodes;
}
public IEnumerable<ZipCode> GetLocationCities(string city)
{
return zipCodes;
}
public IEnumerable<ZipCode> GetLocationStates(string state)
{
return zipCodes;
}
}
private void LocateZipCode(TextBox source, TextBox destination, TextBox destination2 = null)
{
ZipCodes zips = new ZipCodes();
string tempZipCode;
List<ZipCode> zipCodes = new List<ZipCode>();
try
{
if (source.Text.Length == 5)
{
tempZipCode = source.Text.Substring(0, 5);
dataWorker.RunWorkerAsync();
destination.Text = zipCodes.Find(searchZipCode => searchZipCode.ZipCodeID == tempZipCode).City.ToString();
if (destination2.Text != null)
{
destination2.Text = zipCodes.Find(searchZipCode => searchZipCode.ZipCodeID == tempZipCode).State.ToString();
}
}
else destination2.Text = "";
}
catch (NullReferenceException)
{
destination.Text = "Invalid Zip Code";
if (destination2 != null)
{
destination2.Text = "";
}
}
}

There are several options that depend on your use case and target client machines.
Use paged controls. Use existing paged control variants (eg. telerik) which support paging. This way you will deal with smaller subset of the data available.
Use search/filter controls. Force users to enter partial data to reduce the size of the data you need to show.
Using observable collection will cause performance problems as framework provided class does not support bulk load. Make your own observable collection which supports bulk loading (which does not raise collection changed event on every element you add). On a list of 5-10.000 members I've seen loading times reduced from 3s to 0.03s.
Use async operations when loading data from db. This way UI stays responsive and you have a chance to inform users about the current operation. This improves the perceived performance immensely.

Instead of loading all of the items, try loading on demand. For instance, when user enters the first three letters then query the list and return only matching items. Many controls exists for this purpose both in silverlight and ajax.

Thanks for all the responses, I really do appreciate them. A couple I didn't really understand, but I know that's my own lack of knowledge in certain areas of c#. In researching them though, I did stumble across a different solution that had worked beautifully, using a Dictionary<T> instead of a List. Even without using a BackgroundWorker, it loads on app start-up in about 5 seconds. I had heard of Dictionary<T> before, but until now had never had a cause to use/research it, so this was doubly beneficial to me. Thanks again for all the assistance!

Selecting million records from SQL Server

We need to index (in ASP.NET) all our records stored in a SQL Server table. That table has around 2M records with text (nvarchar) data too in each row.
Is it okay to fetch all records in one go as we need to index them (for search)? What is the other option (I want to avoid pagination)?
Note: I am not displaying these records, just need all of them in one go so that I can index them via a background thread.
Do I need to set any long time outs for my query? If yes, what is the most effective method for setting longer time outs if I am running the query from ASP.NET page?

If I needed something like this, just thinking about it from the database side, I'd probably export it to a file. Then that file can get moved around pretty easily. Moving around data sets that large is a huge pain to all involved. You can use SSIS, sqlcmd or even bcp in a batch command to get it done.
Then, you just have to worry about what you're doing with it on the app side, no worries about locking & everything on the database side once you've exported it.

I don't think a page is a good place for this regardless. There should be a different process or program that does this. On a related note maybe something like http://incubator.apache.org/lucene.net/ would help you?

Is it okay to fetch all records in one go as we need to index them
(for search)? What is the other option (I want to avoid pagination)?
Memory Management Issue / Performance Issue
You can face System Out Of Memory Exception in case you are bringing 2 millions of records
As you will be keeping all those records in DataSet and the dataset memory will be in RAM.
Do I need to set any long time outs for my query? If yes, what is the
most effective method for setting longer time outs if I am running the
query from ASP.NET page?
using (System.Data.SqlClient.SqlCommand cmd = new System.Data.SqlClient.SqlCommand())
{
cmd.CommandTimeout = 0;
}
Suggestion
It's better to filter out the record from database level...
Fetch all records from database and save it in a file. Access that file for any intermediate operations.

What you describe in Extract Transform Load (ETL). there are 2 options I'm aware of:
SSIS which is part of sql server
Rhino.ETL
I prefer Rhino.Etl as it's comletely written in C#, you can create scripts in Boo and it's much easier to test and compose ETL Processes. And the library is built to handle large sets of data, so memory management is built in.
One final note: while asp.net might be the entry point to start the indexing process, I wouldn't run the process within asp.net as it could take minutes or hours depending on the amount of records and processing.
instead have asp.net be the entry point to fires off a background task to process the records. Ideally, completely independent of asp.net so you avoid any timeout or shutdown issues.

Process your records in batches. You are going to have two main issues. (1) You need to index all of the existing records. (2) you will want to update the index with records that were added, updated or deleted. It might sound eaiser just to drop the index and recreate it, but it should be avoided if possible. Below is an example of processing the [Production].[TransactionHistory] from the AdventureWorks2008R2 database in batches of 10,000 records. It does not load all of the records into memory. Output on my local computer produces Processed 113443 records in 00:00:00.2282294. Obviously, this doesn't take into consideration remote computer and processing time for each record.
class Program
{
private static string ConnectionString
{
get { return ConfigurationManager.ConnectionStrings["db"].ConnectionString; }
}
static void Main(string[] args)
{
int recordCount = 0;
int lastId = -1;
bool done = false;
Stopwatch timer = Stopwatch.StartNew();
do
{
done = true;
IEnumerable<TransactionHistory> transactionDataRecords = GetTransactions(lastId, 10000);
foreach (TransactionHistory transactionHistory in transactionDataRecords)
{
lastId = transactionHistory.TransactionId;
done = false;
recordCount++;
}
} while (!done);
timer.Stop();
Console.WriteLine("Processed {0} records in {1}", recordCount, timer.Elapsed);
}
/// Get a new open connection
private static SqlConnection GetOpenConnection()
{
SqlConnection connection = new SqlConnection(ConnectionString);
connection.Open();
return connection;
}
private static IEnumerable<TransactionHistory> GetTransactions(int lastTransactionId, int count)
{
const string sql = "SELECT TOP(#count) [TransactionID],[TransactionDate],[TransactionType] FROM [Production].[TransactionHistory] WHERE [TransactionID] > #LastTransactionId ORDER BY [TransactionID]";
return GetData<TransactionHistory>((connection) =>
{
SqlCommand command = new SqlCommand(sql, connection);
command.Parameters.AddWithValue("#count", count);
command.Parameters.AddWithValue("#LastTransactionId", lastTransactionId);
return command;
}, DataRecordToTransactionHistory);
}
// funtion to convert a data record to the TransactionHistory object
private static TransactionHistory DataRecordToTransactionHistory(IDataRecord record)
{
TransactionHistory transactionHistory = new TransactionHistory();
transactionHistory.TransactionId = record.GetInt32(0);
transactionHistory.TransactionDate = record.GetDateTime(1);
transactionHistory.TransactionType = record.GetString(2);
return transactionHistory;
}
private static IEnumerable<T> GetData<T>(Func<SqlConnection, SqlCommand> commandBuilder, Func<IDataRecord, T> dataFunc)
{
using (SqlConnection connection = GetOpenConnection())
{
using (SqlCommand command = commandBuilder(connection))
{
using (IDataReader reader = command.ExecuteReader())
{
while (reader.Read())
{
T record = dataFunc(reader);
yield return record;
}
}
}
}
}
}
public class TransactionHistory
{
public int TransactionId { get; set; }
public DateTime TransactionDate { get; set; }
public string TransactionType { get; set; }
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.