A third party application creates one database per project. All the databases have the same tables and structure. New projects may be added at anytime so I can't use any EF schema.
What I do now is:
private IEnumerable<Respondent> getListRespondentWithStatuts(string db)
{
return query("select * from " + db + ".dbo.respondent");
}
private List<Respondent> query(string sqlQuery)
{
using (var sqlConx = new SqlConnection(Settings.Default.ConnectionString))
{
sqlConx.Open();
var cmd = new SqlCommand(sqlQuery, sqlConx);
return transformReaderIntoRespondentList(cmd.ExecuteReader());
}
}
private List<Respondent> transformReaderIntoRespondentList(SqlDataReader sqlDataReader)
{
var listeDesRépondants = new List<Respondent>();
while (sqlDataReader.Read())
{
var respondent = new Respondent
{
CodeRépondant = (string)sqlDataReader["ResRespondent"],
IsActive = (bool?)sqlDataReader["ResActive"],
CodeRésultat = (string)sqlDataReader["ResCodeResult"],
Téléphone = (string)sqlDataReader["Resphone"],
IsUnContactFinal = (bool?)sqlDataReader["ResCompleted"]
};
listeDesRépondants.Add(respondent);
}
return listeDesRépondants;
}
This works fine, but it is deadly slow (20 000 records per minutes). Do you have any hints on what strategy should be faster? For info, what is really slow is transformReaderIntoRespondentList method
Thanks!!
Generally speaking anything SELECT * FROM is bad practice, but it could also be resulting in you having to pull back more data than is actually required. The transform is operating on only a few columns are more columns than required being returned? Consider replacing with:
private IEnumerable<Respondent> getListRespondentWithStatuts(string db)
{
return query("select ResRespondent, ResActive, ResCodeResult, Resphone, ResCompleted from " + db + ".dbo.respondent");
}
Also, gaurd against SQL-Injection attacks; concating strings for SQL queries is very dangerous.
When pulling data from a DataReader, I find that using the non-named lookups work best:
var respondent = new Respondent
{
CodeRépondant = sqlDataReader.GetString(0),
IsActive = sqlDataReader.IsDBNull(1) ? (Boolean?)null : sqlDataReader.GetBoolean(1),
CodeRésultat = sqlDataReader.GetString(2),
Téléphone = sqlDataReader.GetString(3),
IsUnContactFinal = sqlDataReader.IsDBNull(4) ? (Boolean?)null : sqlDataReader.GetBoolean(4)
};
I have not explcicitly tested the performance difference in a long while; but that used to make a notable difference. The ordinal checks did not have to do a named lookup and also avoided boxing/unboxing values.
Other than that, without more info it is hard to say... do you need all 20,000 records?
UPDATE
Ran a simple local test case with 300,000 records and reduced the time to load all data by almost 50%. I imagine these results will vary depending on the type of data being retrieved; but it still does make a difference on overall execution time. That being said, in my environment we are talking a drop from 650ms to just over 300ms.
NOTE
If respondent is a view, what is likely "really slow" is the database building up the result set; although the data reader will start processing information as soon as records are available, the ultimate bottleneck will be the database itself and/or network latency. Other than the above optimizations, there is not going to be much that you can do with your code unless you can index the view/table to optimize the query and or reduce the information required.
Related
I'm not sure External Source is the correct phrasing, but essentially I have a view in my database that points to a table in a different database. Not always, but from time to time I get an ORA-12537 Network Session: End of File exception. I'm using Entity Framework, so I tried breaking it up so instead of using one massive query, it does a handful of queries to generate the final result. But this has had a mixed-to-no impact.
public List<SomeDataModel> GetDataFromList(List<string> SOME_LIST_OF_STRINGS)
{
var retData = new List<SomeDataModel>();
const int MAX_CHUNK_SIZE = 1000;
var totalPages = (int)Math.Ceiling((decimal)SOME_LIST_OF_STRINGS.Count / MAX_CHUNK_SIZE);
var pageList = new List<List<string>>();
for(var i = 0; i < totalPages; i++)
{
var chunkItems = SOME_LIST_OF_STRINGS.Skip(i * MAX_CHUNK_SIZE).Take(MAX_CHUNK_SIZE).ToList();
pageList.Add(chunkItems);
}
using (var context = new SOMEContext())
{
foreach(var pageChunk in pageList)
{
var result = (from r in context.SomeEntity
where SOME_LIST_OF_STRINGS.Contains(r.SomeString)
select r).ToList();
result.ForEach(x => retData.Add(mapper.Map<SomeDataModel>(x)));
}
}
return retData;
}
I'm not sure if there's a different approach to dealing with this exception or not, or if breaking up the query has any desired effect. It's probably worth noting that SOME_LIST_OF_STRINGS is pretty large (about 21,000 on average), so totalPages usually sits around 22.
Sometimes, that error can be caused by an excessively large "IN" list in the SQL. For example:
SELECT *
FROM tbl
WHERE somecol IN ( ...huge list of stuff... );
Enabling application or database level tracing could help reveal whether the SQL that's being constructed behind the scenes has a large IN list.
A workaround might be to INSERT "...huge list of stuff..." into a table and then use something similar to the query below in order to avoid the huge list of literals.
SELECT *
FROM tbl
WHERE somecol IN ( select stuff from sometable );
Reference*:
https://support.oracle.com/knowledge/More%20Applications%20and%20Technologies/2226769_1.html
*I mostly drew my conclusions from the part of this reference that's not publicly viewable.
I'm using Entity Framework to build a database. There's two models; Workers and Skills. Each Worker has zero or more Skills. I initially read this data into memory from a CSV file somewhere, and store it in a dictionary called allWorkers. Next, I write the data to the database as such:
// Populate database
using (var db = new SolverDbContext())
{
// Add all distinct skills to database
db.Skills.AddRange(allSkills
.Distinct(StringComparer.InvariantCultureIgnoreCase)
.Select(s => new Skill
{
Reference = s
}));
db.SaveChanges(); // Very quick
var dbSkills = db.Skills.ToDictionary(k => k.Reference, v => v);
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.Workers.AddRange(workforce);
db.SaveChanges(); // This call takes 00:05:00.0482197
}
The last db.SaveChanges(); takes over five minutes to execute, which I feel is far too long. I ran SQL Server Profiler as the call is executing, and basically what I found was thousands of calls to:
INSERT [dbo].[SkillWorkers]([Skill_SkillId], [Worker_WorkerId])
VALUES (#0, #1)
There are 16,027 rows being added to SkillWorkers, which is a fair amount of data but not huge by any means. Is there any way to optimize this code so it doesn't take 5min to run?
Update: I've looked at other possible duplicates, such as this one, but I don't think they apply. First, I'm not bulk adding anything in a loop. I'm doing a single call to db.SaveChanges(); after every row has been added to db.Workers. This should be the fastest way to bulk insert. Second, I've set db.Configuration.AutoDetectChangesEnabled to false. The SaveChanges() call now takes 00:05:11.2273888 (In other words, about the same). I don't think this really matters since every row is new, thus there are no changes to detect.
I think what I'm looking for is a way to issue a single UPDATE statement containing all 16,000 skills.
One easy method is by using the EntityFramework.BulkInsert extension.
You can then do:
// Add all workers to database
var workforce = allWorkers.Values
.Select(i => new Worker
{
Reference = i.EMPLOYEE_REF,
Skills = i.GetSkills().Select(s => dbSkills[s]).ToArray(),
DefaultRegion = "wa",
DefaultEfficiency = i.TECH_EFFICIENCY
});
db.BulkInsert(workforce);
I'm trying to pull a large-ish dataset (1.4 million records) from a SQL Server and dump to a file in a WinForms application. I've attempted to do it with paging, so that I'm not holding too much in memory at once, but the process continues to grow it's memory footprint as it runs. About 25% through, it was taking up 600,000K. Am I doing the paging wrong? Can I get some suggestions on how to keep the memory usage from growing so much?
var query = (from organizations in ctxObj.Organizations
where organizations.org_type_cd == 1
orderby organizations.org_ID
select organizations);
int recordCount = query.Count();
int skipTo = 0;
int take = 1000;
if (recordCount > 0)
{
while (skipTo < recordCount)
{
if (skipTo + take > recordCount)
take = recordCount - skipTo;
foreach (Organization o in query.Skip(skipTo).Take(take))
{
writeRecord(o);
}
skipTo += take;
}
}
The object context will keep on objects in memory until it's disposed. I would recommend disposing the context after each batch to prevent the memory footprint from continuing to grow.
You can also use AsNoTracking() (http://msdn.microsoft.com/en-us/library/gg679352(v=vs.103).aspx) since you are not saving back to the database.
Get rid of paging and use AsNoTracking.
Test Code
static void Main(string[] args)
{
var sw = new Stopwatch();
sw.Start();
using (var context = new MyEntities())
{
var query = (from organizations in context.LargeSampleTable.AsNoTracking()
where organizations.ErrorID != null
orderby organizations.ErrorID
select organizations);//large sample table, 146994 rows
foreach (MyObject o in query)
{
writeRecord(o);
}
}
sw.Stop();
Console.WriteLine("Completed after: {0}", sw.Elapsed);
Console.ReadLine();
}
private static void writeRecord(ApplicationErrorLog o)
{
;
}
Test Case Result:
Memory Consumption reduced: 96%
Execution Time reduced: 50%
Interpretation
AsNoTracking provides benefits to memory usage for obvious reasons, we don't have to maintain references to the entities as we load them into memory. Objects are GC elegible almost immediately. Combine lazy evaluation and AsNoTracking and there is no need for paging and context destruction can be deferred.
While this is a single test the large number of rows and exclusion of most external factors make this a good representation for the general case.
A few things.
Calling Count() runs your query. You then run it a second time to get the results. You don't need to do this.
The memory you're seeing is due to loading entities into memory. If you only need a subset of fields, project to an anonymous type (or a simpler named type.) This will avoid any change tracking and overhead.
Used in this way, EF can be a nice strongly typed API to lightweight SQL queries.
Something like this should do the trick:
var query = from organizations in ctxObj.Organizations
where organizations.org_type_cd == 1
orderby organizations.org_ID
select new { o.Id, o.Name };
foreach (var org in query)
{
write(org.Id, org.Name);
}
Why don't you just use a standard System.Data.SqlClient.SqlConnection class? You can read the results of a command line by line using the SqlDataReader class and write each line to a file. You have full control to guarantee that your code is only referencing one line of records at a time.
using (var writer = new System.IO.StreamWriter(fileName))
using (var conn = new SqlConnection(connectionString))
{
using (var cmd = new SqlCommand())
{
cmd.CommandText = "SELECT * FROM Organizations WHERE org_type_cd = 1 ORDER BY org_ID";
using (var reader = cmd.ExecuteReader())
{
while (reader.Read())
{
int id = (int)reader["org_ID"];
int org_type_cd = (int)reader["org_type_cd"];
writer.WriteLine(...);
}
}
}
}
Entity Framework isn't meant to solve every problem or to be your exclusive data access framework. It's meant to things easier to write for simple CRUD operations. Dealing with millions of rows is a good use case for a more specialized solution.
I am in the process of improving a console app and at the moment I cant get it to update rows instead of just creating a new row with the newer information in it.
class Program
{
List<DriveInfo> driveList = DriveInfo.GetDrives().Where(x => x.IsReady).ToList<DriveInfo>(); //Get all the drive info
Server server = new Server(); //Create the server object
ServerDrive serverDrives = new ServerDrive();
public static void Main()
{
Program c = new Program();
c.RealDriveInfo();
c.WriteInToDB();
}
public void RealDriveInfo()
{
//Insert information of one server
server.ServerID = 0; //(PK) ID Auto-assigned by SQL
server.ServerName = string.Concat(System.Environment.MachineName);
//Inserts ServerDrives information.
for (int i = 0; i < driveList.Count; i++)
{
//All Information used in dbo.ServerDrives
serverDrives.DriveLetter = driveList[i].Name;
serverDrives.TotalSpace = driveList[i].TotalSize;
serverDrives.DriveLabel = driveList[i].VolumeLabel;
serverDrives.FreeSpace = driveList[i].TotalFreeSpace;
serverDrives.DriveType = driveList[i].DriveFormat;
server.ServerDrives.Add(serverDrives);
}
}
public void WriteInToDB()
{
//Add the information to an SQL Database using Linq.
DataClasses1DataContext db = new DataClasses1DataContext(#"sqlserver");
db.Servers.InsertOnSubmit(server);
db.SubmitChanges();
What I would like it to use to update the information would be the RealDriveInfo() Method so instead of creating new entries it updates the currently stored information by running the method then inserting the information from the method and if needed will enter a new entry instead of simply entering new entries every time it has newer information.
At the moment it is running the method, gathering the relevant data then entering it in as a new row in both tables.
Any help would be appreciated :)
It's creating a new db entry each time because you are making a new server object each time, then calling InsertOnSubmit() - which inserts (creates) a new record.
I'm not entirely sure what you are trying to do, but a db update would involve selecting an existing record, modifying it, then attaching it back to the data context and calling SubmitChanges().
This article on Updating Entities (Linq toSQL) might help.
The problem is that you are trying to achieve Update functionality with a tool that is designed to provide object-oriented quering. LINQ allows for updating exisitng records, but you have to use it in a proper way to achieve this.
The proper way is to fetch data you want to update from the DB, perform modifications and then flush it back to the DB. So, assuming there are table named Servers in your data context, here's an abstract example:
DataClasses1DataContext db = new DataClasses1DataContext(#"sqlserver");
var servers = db.Servers.Where(srv=>srv.ID>1000); //extracting all servers with ID > 100 using lambda expression
foreach (server in servers){
server.Memory *=2; //let's feed them up with memory
}
db.Servers.SubmitChanges();
Another way to achieve this is to create an entity, than attach it to the DataContext using Table.Attach method, but it's quite a slippery slope, so I wouldn't recommend you taking it unless you have your LINQ skills improved.
For a detailed description, see
SubmitChanges
Lambda Expressions
I understand what is being asked, and I do not have an easy answer.
Example, you have a form of values, several of the values are changed, maybe some calculated. Or the form can contain a new record.
You create a record of the values
myrecord = new MyRecord()
Then fill in myRecord. doing what ever validation/calculations you want before you even touch the database itself.
//GetID either returns an existing ID or it returns a zero if this is a new record.
myrecord.id = GetIDForRecordOrZeroIfANewRecord(uniqueName);
myrecord.value1 = txtValue1.text;
myrecord.value2 = (DateTime)dtDate.value;
and so on through the fields.
You now have a record, if id is zero you can add it as a new record. But if id is an existing record you seem to have no choice with Linq except to have a function that writes each value from myrecord, so you have to have a function that contains something like -
var thisRecord = from n in mydatacontext.MyTable
where n.id == myrecord.id
select n;
thisrecord.value1 = myrecord.value1;
thisrecord.value2 = myrecord.value2;
and so on through all fields.
I do it, but it seems long winded when I already have all of the information ready in myrecord. A simple function of
mydatacontext.MyTable.Update(myrecord);
Would be ideal. Simmilar in fact to what I do with stored SQL functions in other databases, it simplifies the transfer of a record that is an update rather than new.
I find that my C# apps do a lot of queries with a lot of boilerplate that clutters up my code space. I also want to avoid repetition, but I'm not sure how I could write a method to do this generically.
I am accessing an Oracle database using ODP. I can't use Linq because our data warehouse people refuse to designate primary keys, and ODP support for Linq appears to be, well ... they'd rather have you use their platform.
I can't really return a List because every query returns different numbers of different types.
string gufcode = String.Empty;
double cost = 0.0;
OracleCommand GUFCommand2 = thisConnection.CreateCommand();
String GUFQuery2 = "SELECT GUF_ID, COST_RATE FROM SIMPLE_TABLE";
GUFCommand2.CommandText = GUFQuery2;
OracleDataReader GUFReader2 = GUFCommand2.ExecuteReader();
while (GUFReader2.Read())
{
if (GUFReader2[0/**GUF_CODE**/] != DBNull.Value)
{
gufcode = Convert.ToString(BUFReader2[0]);
}
if (GUFReader2[1/**COST_RATE**/] != DBNull.Value)
{
cost = Convert.ToDouble(GUFReader2[1]);
}
effortRatioDictionary.Add(bufcode, percentageOfEffort);
}
GUFReader2.Close();
But there's really a lot more terms and a lot more queries like this. I'd say 15 or so queries -some with as many as 15 or so fields returned.
Copy/pasting this boilerplate everywhere leads to a lot of fires: for example if I don't update everything in the copy paste I'll close the wrong reader (or worse) send a different query string to the database.
I'd like to be able to do something like this:
string gufQuery = "SELECT GUF_ID, COST_RATE FROM SIMPLE_TABLE";
List<something> gufResponse = miracleProcedure(gufQuery, thisConnection);
And so most of the boilerplate goes away.
I'm looking for something simple.
I think the main reason why you are not being able to abstract away a function is because the return data is going to be different everytime.
Which means the number of read is going to be different everytime as well.
You could just return GUFReader2 but then you will lose the ability to close it inside the function which you want.
I would say return an array (or list) of objects.
Inside the procedure, just read through every row and return the list while closing the connection.
Your calling function will always know what and in which sequence the expected data will be. It will have to cast the object data too, but you are doing that inside this procedure anyways.
Some hints:
Deriving from IDisposable allows for cleaner code with using statement.
IMO your magic method should look more like this :
List<T> list = doMagic("SIMPLE_TABLE", columns);
columns could be an array of small structs like this one :
struct Column
{
string Name;
Type DataType;
}
You might be able to use enums if you use the same tables/columns very often.
Or
You can take some inspiration from VertexDeclaration, VertexElement, VertexElementFormat and VertexElementUsage types that are in XNA : http://msdn.microsoft.com/en-us/library/bb197344.
This proven to be very helpful when dealing with a different number of 'inputs' in a random order.
In my case I've been able to build an easy to use, XNA-like framework for OpenGL with such stuff.
Regarding the return type of your list, refer to my 2nd suggestion.
Linq was the right answer. I give credit to David M, above, but I can't mark it as the correct answer since he only left a comment.
I was able to do a semi-generalized method using ArrayLists:
public static ArrayList GeneralQuery(string thisQuery, OracleConnection myConnection)
{
ArrayList outerAL = new ArrayList();
OracleCommand GeneralCommand = myConnection.CreateCommand();
GeneralCommand.CommandText = thisQuery;
OracleDataReader GeneralReader = GeneralCommand.ExecuteReader();
while (GeneralReader.Read())
{
for (int i = 0; i < GeneralReader.FieldCount; i++)
{
ArrayList innerAL = new ArrayList();
if (GeneralReader[i] != DBNull.Value)
{
innerAL.Add(GeneralReader[i]);
}
else
{
innerAL.Add(0);
}
outerAL.Add(innerAL);
}
}
GeneralReader.Close();
return outerAL;
}
And the code that calls this method looks like this:
thisConnection.Open();
List<ProjectWrapper> liProjectCOs = new List<ProjectWrapper>();
String ProjectQuery = "SELECT SF_CLIENT_PROJECT.ID, SF_CLIENT_PROJECT.NAMEX, SF_CHANGE_ORDER.ID, SF_CHANGE_ORDER.END_DATE, ";
ProjectQuery += "SF_CLIENT_PROJECT.CONTRACTED_START_DATE, SF_CHANGE_ORDER.STATUS, SF_CHANGE_ORDER.TYPE, SF_CLIENT_PROJECT.ESTIMATED_END_DATE, SF_CLIENT_PROJECT.CONTRACTED_END_DATE ";
ProjectQuery += "FROM SF_CLIENT_PROJECT, SF_CHANGE_ORDER ";
ProjectQuery += "WHERE SF_CHANGE_ORDER.TYPE = 'New' ";
ProjectQuery += "AND SF_CLIENT_PROJECT.ID = SF_CHANGE_ORDER.PROJECT";
ArrayList alProjects = GeneralQuery(ProjectQuery, thisConnection);
foreach( ArrayList proj in alProjects ) {
ProjectWrapper pw = new ProjectWrapper();
pw.projectId = Convert.ToString( proj[0] );
pw.projectId = Convert.ToString(proj[0]);
pw.projectName = Convert.ToString(proj[1]);
pw.changeOrderId = Convert.ToString(proj[2]);
pw.newEndDate = Convert.ToDateTime(proj[3]);
pw.startDate = Convert.ToDateTime(proj[4]);
pw.status = Convert.ToString(proj[5]);
pw.type = Convert.ToString(proj[6]);
if ( Convert.ToString(proj[7]) != "0" ) // 0 returned by generalquery if null
pw.oldEndDate = Convert.ToDateTime(proj[7]);
else
pw.oldEndDate = Convert.ToDateTime(proj[8]);
liProjectCOs.Add(pw);
There's a lot of obvious disadvantages here (although it is a lot better than what I was trying to do earlier). It is so much worse than Linq I renegotiated with our data warehouse people. There's a new guy there, and he was a lot more helpful.
Linq reduces the lines of code from above by a factor of 2. It is a factor of 4 from the non-encapsulated way I was doing it before that.