I'm trying to import a file into a database and learn a more efficient way of doing things along the way. This article suggested chaining enumerations yields low memory usage and good performance.
This is my first time chaining several enumerations, and I'm not quite sure how to handle a reset appropriately...
Short story:
I have an enumeration which reads from a text file and maps to a DTO (see the Map Function), a Where enumerator, followed by an Import that takes an enumeration... It all works perfectly, except that when the filter returns 0 records... In that case, SQL errors saying System.ArgumentException: There are no records in the SqlDataRecord enumeration....
So I put a if(!list.Any()) return; at the top of my Import method, and it seems to work not error. Except it will always skip all the rows up to (and including) the first Valid row in the text file...
Do I need to somehow Reset() the enumerator after my Any() call? Why is this not necessary when the same struct is used in Linq and other Enumerable implementations?
Code:
public IEnumerable<DTO> Map(TextReader sr)
{
while (null != (line = sr.ReadLine()))
{
var dto = new DTO();
var row = line.Split('\t');
// ... mapping logic goes here ...
yield return (T)obj;
}
}
//Filter, called elsewhere
list.Where(x => Valid(x)).Select(x => LoadSqlRecord(x))
public void Import(IEnumerable<SqlDataRecord> list)
{
using (var conn = new SqlConnection(...))
{
if (conn.State != ConnectionState.Open)
conn.Open();
var cmd = conn.CreateCommand();
cmd.CommandText = "Import_Data";
cmd.CommandType = System.Data.CommandType.StoredProcedure;
var parm = new SqlParameter();
cmd.Parameters.Add(parm);
parm.ParameterName = "Data"
parm.TypeName = "udt_DTO";
parm.SqlDbType = SqlDbType.Structured;
parm.Value = list;
cmd.ExecuteNonQuery();
conn.Close();
}
}
Sorry for the long example. Thanks for reading...
The issue you are seeing is likely not because of the IEnumerable/IEnumerator interfaces, but rather with the underlying resources you are using to produce values.
For most enumerables, adding a list.Any() would not cause future enumerations of list to skip items because each call to list.GetEnumerator returns independent objects. You may have other reasons to not want to make multiple enumerators, such as multiple calls to the database via LINQ to SQL, but it every enumerator will get all the items.
In this case however, making multiple enumerators is not working because of the underlying resources. Based on the code you posted, I assume that the parameter passed to Import is based on a call to the Map method you posted. Each time through the enumerable returned from Map, you will "start" at the top of the method, but the TextReader and its current position is shared between all enumerators. Even if you did try to call Reset on the IEnumerators, this would not reset the TextReader. To solve your problem, you either need buffer the enumerable (eg ToList) or find a way to reset the TextReader.
Related
What I'm trying to do is to safe guard my C# data retrieval code from IndexOutOfRangeException when using datareader.GetOrdinal(). This is no problem if my procedures donĀ“t change and they just return just one result set. But if they return multiple result sets then I need to iterate the result sets with .NextResult(). That is no problem.
But what if somebody changes the procedure to have another select statement so that the order of my C# retrieval code changes and everything blows up?
But the question is: Can I check if the result is the result that I want?
Here below is the pseudo code for what I like to do.
using (SqlDataReader reader = cmd.ExecuteReader())
{
//The check I would like to be able to do
if(reader.Result != "The result with someColumnName")
{
//This is not the result Im looking for so I try the next one
reader.NextResult();
}
else //Get the result set I want.. If it blows up now it should..
{
if (reader.HasRows)
{
//Get all ordinals first. Faster than searching with index.
int someColumnNameOrdinal = reader.GetOrdinal("someColumnName");
while (reader.Read())
{
var someValue = reader.GetString(someColumnNameOrdinal );
}
}
}
}
I know that I could try to GetOrdinal, get exception, catch it and then try the next result, but that is just to damn unclean (and wrong).
Using OR/M is a good practice, but there are exceptions, of course. For example, you may be forced to query SQL server which does not support stored procs (SQL Server Compact etc.). To increase performance you may opt to using SqlDataReader. To ensure that your field names are always correct (correspond to your entity class fields) you may use the following practice - instead of hard-coding field names use this code instead:
GetPropertyName((YourEntityClass c) => c.YourField)
Where GetPropertyName function contains the following code (generics are used):
public static string GetPropertyName<T, TReturn>(System.Linq.Expressions.Expression<Func<T, TReturn>> expression)
{
System.Linq.Expressions.MemberExpression body = (System.Linq.Expressions.MemberExpression)expression.Body;
return body.Member.Name;
}
YourEntityClass is a class name for your table in Entity Framework, YourField is a field name in this table. In this case you have performance of SqlDataReader and safety of Entity Framework in the same time.
Assuming each of the result sets has a uniquely named first column, for the following line in your pseudo code example:
if(reader.Result != "The result with someColumnName")
you could use the SqlDataReader.GetName function to check the name of the first column. For example:
if(reader.GetName(0) != "ExpectedColumnName")
I spend a lot of time querying a database and then building collections of objects from the query. For performance I tend to use a Datareader and the code looks something like:
while(rdr.Read()){
var myObj = new myObj();
myObj.Id = Int32.Parse(rdr["Id"].ToString();
//more populating of myObj from rdr
myObj.Created = (DateTime)rdr["Created"];
}
For objects like DateTime I simply cast the rdr value to the required class, but this can't be done for value types like int hence the (IMHO) laborious ToString() followed by Int.Parse(...)
Of course there is an alternative:
myObj.Id = rdr.GetInt32(rdr.GetOrdinal("Id"));
which looks cleaner and doesn't involve a call to ToString().
A colleague and I were discussing this today - he suggests that accessing rdr twice in the above code might be less efficient that doing it my old skool way - could anyone confirm or deny this and suggest which of the above is the best way of doing this sort of thing? I would especially welcome answers from #JonSkeet ;-)
I doubt there will be a very appreciable performance difference, but you can avoid the name lookup on every row simply by lifting it out of the loop. This is probably the best you'll be able to achieve:
int idIdx = rdr.GetOrdinal("Id");
int createdIdx = rdr.GetOrdinal("Created");
while(rdr.Read())
{
var myObj = new myObj();
myObj.Id = rdr.GetFieldValue<int>(idIdx);
//more populating of myObj from rdr
myObj.Created = rdr.GetFieldValue<DateTime>(createdIdx);
}
I usually introduce a RecordSet class for this purpose:
public class MyObjRecordSet
{
private readonly IDataReader InnerDataReader;
private readonly int OrdinalId;
private readonly int OrdinalCreated;
public MyObjRecordSet(IDataReader dataReader)
{
this.InnerDataReader = dataReader;
this.OrdinalId = dataReader.GetOrdinal("Id");
this.OrdinalCreated = dataReader.GetOrdinal("Created");
}
public int Id
{
get
{
return this.InnerDataReader.GetInt32(this.OrdinalId);
}
}
public DateTime Created
{
get
{
return this.InnerDataReader.GetDateTime(this.OrdinalCreated);
}
}
public MyObj ToObject()
{
return new MyObj
{
Id = this.Id,
Created = this.Created
};
}
public static IEnumerable<MyObj> ReadAll(IDataReader dataReader)
{
MyObjRecordSet recordSet = new MyObjRecordSet(dataReader);
while (dataReader.Read())
{
yield return recordSet.ToObject();
}
}
}
Usage example:
List<MyObj> myObjects = MyObjRecordSet.ReadAll(rdr).ToList();
This makes the most sense to a reader. Whether it's the most "efficient" (you're literally calling two functions instead of one, it's not going to be as significant as casting, then calling a function). Ideally you should go with the option that looks more readable if it doesn't hurt your performance.
var ordinal = rdr.GetOrdinal("Id");
var id = rdr.GetInt32(ordinal);
myObj.Id = id;
Actually there is are differences in performance in how you use SqlDataReader, but they are somewhere else. Namely the ExecuteReader method accepts the CommandBehavior.SequentialAccess:
Provides a way for the DataReader to handle rows that contain columns with large binary values. Rather than loading the entire row, SequentialAccess enables the DataReader to load data as a stream. You can then use the GetBytes or GetChars method to specify a byte location to start the read operation, and a limited buffer size for the data being returned.
When you specify SequentialAccess, you are required to read from the columns in the order they are returned, although you are not required to read each column. Once you have read past a location in the returned stream of data, data at or before that location can no longer be read from the DataReader. When using the OleDbDataReader, you can reread the current column value until reading past it. When using the SqlDataReader, you can read a column value only once.
If you do not use large binary values then it makes very little difference. Getting a string and parsing is suboptimal, true, is better to get the value with rdr.SqlInt32(column) rather than a GetInt32() because of NULL. But the difference should not be noticeable on most application, unles your app is trully doing nothing else but read huge datasets. Most apps do not behave that way. Focusing on optimising the databse call itself(ie. have the query execute fast) will reap far greater benefits 99.9999% of the times.
For objects like DateTime I simply cast the rdr value to the required class, but this can't be done for value types like int
This isn't true: DateTime is also a value type and both of the following work in the same way, provided the field is of the expected type and is not null:
myObj.Id = (int) rdr["Id"];
myObj.Created = (DateTime)rdr["Created"];
If it's not working for you, perhaps the field you're reading is NULL? Or not of the required type, in which case you need to cast twice. E.g. for a SQL NUMERIC field, you might need:
myObj.Id = (int) (decimal) rdr["Id"];
I use ExecuteReader().
It returns only the last result. I wanted to display the result like an array in tbid_1.Text, tbid_1.Text, tbid_1.Text etc.
public void Select(FrmVIRGO frm)
{
string query = "SELECT* FROM tb_patient_information ";
if (this.OpenConnection() == true)
{ //Create Command
MySqlCommand cmd = new MySqlCommand(query, connection);
//Create a data reader and Execute the command
MySqlDataReader dataReader = cmd.ExecuteReader();
while (dataReader.Read())
{
// I think use like this
frm.tbid_1.Text = dataReader["id_patient"][1].ToString(); //... id_patient1
frm.tbid_2.Text = dataReader["id_patient"][2].ToString(); //... id_patient2
frm.tbid_3.Text = dataReader["id_patient"][3].ToString(); //... id_patient3
}
//close Data Reader
dataReader.Close();
//close Connection
this.CloseConnection();
}
}
Your code appears to be expecting that once you've called dataReader.Read(), you can access all of the records by an index.
Your data reader is an instance of IDataReader, the interface that most .NET data access libraries use to represent the concept of "reading the results of a query". IDataReader only gives you access to one record at a time. Each time you call dataReader.Read(), the IDataReader advances to the next record. When it returns false, it means that you have reached the end of the result set.
For example, you could transform your code above to something like this:
dataReader.Read(); // dataReader is at 1st record
frm.tbid_1.Text = dataReader["id_patient"].ToString();
dataReader.Read(); // dataReader is at 2nd record
frm.tbid_2.Text = dataReader["id_patient"].ToString();
dataReader.Read(); // dataReader is at 3rd record
frm.tbid_3.Text = dataReader["id_patient"].ToString();
Note that this is not the way you should do it, I'm just using it to illustrate the way a DataReader works.
If you are expecting exactly 3 records to be returned, you could use something similar to the code above. I would modify it to verify that dataReader.Read() returns true before you read data from each record, however, and handle a case where it doesn't in a meaningful way (e.g., throw an exception that explains the error, log the error, etc.).
Generally though, if I am working with raw ADO.Net (as opposed to using an OR/M) I prefer to convert each record in the IDataReader to a dictionary beforehand, and work with those.
For example, you could write the following extension method for DataReader:
public static class DataReaderExtensions
{
public static IList<IDictionary<string, object>> ListRecordsAsDictionaries(this IDataReader reader)
{
var list = new List<IDictionary<string, object>>();
while (reader.Read())
{
var record = new Dictionary<string, object>();
for (var i = 0; i < reader.FieldCount; i++)
{
var key = reader.GetName(i);
var value = reader[i];
record.Add(key, value);
}
list.Add(record);
}
return list;
}
}
This method iterates over the IDataReader, and sticks the values from each row into a Dictionary<string, object>. I have found this pattern to generally be fairly useful when dealing with raw ADO stuff.
This approach has a couple of caveats:
You don't get access to the records (the Dictionary instances) until all of the data has been received from the server. If you are handling records individually as the DataReader makes them available, you may actually be able to start processing the data while some of it is still in-transit. (Note: This could be fixed by making this method return IEnumerable<IDictionary<string, object>> instead of IList<IDictionary<string, object>>, and using yield return to yield each record as it becomes available.)
If you are iterating over a large data set, you may not want to instantiate that many dictionaries. It may be better to just handle each record individually instead.
You lose access to some of the information that DataReader can provide about the record(s) (e.g., you can't use DataReader.GetDataTypeName as you are iterating over the records).
I am transforming an Excel spreadsheet into a list of "Elements" (this is a domain term). During this transformation, I need to skip the header rows and throw out malformed rows that cannot be transformed.
Now comes the fun part. I need to capture those malformed records so that I can report on them. I constructed a crazy LINQ statement (below). These are extension methods hiding the messy LINQ operations on the types from the OpenXml library.
var elements = sheet
.Rows() <-- BEGIN sheet data transform
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows() <-- END sheet data transform
.ToElements(strings) <-- BEGIN domain transform
.RemoveBadRecords(out discard)
.OrderByCompositeKey();
The interesting part starts at ToElements, where I transform the row lookup to my domain object list (details: it's called an ElementRow, which is later transformed into an Element). Bad records are created with just a key (the Excel row index) and are uniquely identifiable vs. a real element.
public static IEnumerable<ElementRow> ToElements(this IEnumerable<KeyValuePair<UInt32Value, Cell[]>> map)
{
return map.Select(pair =>
{
try
{
return ElementRow.FromCells(pair.Key, pair.Value);
}
catch (Exception)
{
return ElementRow.BadRecord(pair.Key);
}
});
}
Then, I want to remove those bad records (it's easier to collect all of them before filtering). That method is RemoveBadRecords, which started like this...
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements)
{
return elements.Where(el => el.FormatId != 0);
}
However, I need to report the discarded elements! And I don't want to muddy my transform extension method with reporting. So, I went to the out parameter (taking into account the difficulties of using an out param in an anonymous block)
public static IEnumerable<ElementRow> RemoveBadRecords(this IEnumerable<ElementRow> elements, out List<ElementRow> discard)
{
var temp = new List<ElementRow>();
var filtered = elements.Where(el =>
{
if (el.FormatId == 0) temp.Add(el);
return el.FormatId != 0;
});
discard = temp;
return filtered;
}
And, lo! I thought I was hardcore and would have this working in one shot...
var discard = new List<ElementRow>();
var elements = data
/* snipped long LINQ statement */
.RemoveBadRecords(out discard)
/* snipped long LINQ statement */
discard.ForEach(el => failures.Add(el));
foreach(var el in elements)
{
/* do more work, maybe add more failures */
}
return new Result(elements, failures);
But, nothing was in my discard list at the time I looped through it! I stepped through the code and realized that I successfully created a fully-streaming LINQ statement.
The temp list was created
The Where filter was assigned (but not yet run)
And the discard list was assigned
Then the streaming thing was returned
When discard was iterated, it contained no elements, because the elements weren't iterated over yet.
Is there a way to fix this problem using the thing I constructed? Do I have to force an iteration of the data before or during the bad record filter? Is there another construction that I've missed?
Some Commentary
Jon mentioned that the assignment /was/ happening. I simply wasn't waiting for it. If I check the contents of discard after the iteration of elements, it is, in fact, full! So, I don't actually have an assignment problem. Unless I take Jon's advice on what's good/bad to have in a LINQ statement.
When the statement was actually iterated, the Where clause ran and temp filled up, but discard was never assigned again!
It doesn't need to be assigned again - the existing list which will have been assigned to discard in the calling code will be populated.
However, I'd strongly recommend against this approach. Using an out parameter here is really against the spirit of LINQ. (If you iterate over your results twice, you'll end up with a list which contains all the bad elements twice. Ick!)
I'd suggest materializing the query before removing the bad records - and then you can run separate queries:
var allElements = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.ToList();
var goodElements = allElements.Where(el => el.FormatId != 0)
.OrderByCompositeKey();
var badElements = allElements.Where(el => el.FormatId == 0);
By materializing the query in a List<>, you only process each row once in terms of ToRowLookup, ToCellLookup etc. It does mean you need to have enough memory to keep all the elements at a time, of course. There are alternative approaches (such as taking an action on each bad element while filtering it) but they're still likely to end up being fairly fragile.
EDIT: Another option as mentioned by Servy is to use ToLookup, which will materialize and group in one go:
var lookup = sheet
.Rows()
.SkipColumnHeaders()
.ToRowLookup()
.ToCellLookup()
.SkipEmptyRows()
.ToElements(strings)
.OrderByCompositeKey()
.ToLookup(el => el.FormatId == 0);
Then you can use:
foreach (var goodElement in lookup[false])
{
...
}
and
foreach (var badElement in lookup[true])
{
...
}
Note that this performs the ordering on all elements, good and bad. An alternative is to remove the ordering from the original query and use:
foreach (var goodElement in lookup[false].OrderByCompositeKey())
{
...
}
I'm not personally wild about grouping by true/false - it feels like a bit of an abuse of what's normally meant to be a key-based lookup - but it would certainly work.
This isn't related to a particular issue BUT is a question regarding "best practise".
For a while now, when I need to get data straight from the database I've been using the following method - I was wondering if there's a faster method which I don't know about?
DataTable results = new DataTable();
using (SqlConnection connection = new SqlConnection(ConfigurationManager.ConnectionStrings["Name"]))
{
connection.Open();
using (SqlCommand command = new SqlCommand("StoredProcedureName",connection))
{
command.CommandType = CommandType.StoredProcedure;
/*Optionally set command.Parameters here*/
results.Load(command.ExecuteReader());
}
}
/*Do something useful with the results*/
There are indeed various ways of reading data; DataTable is quite a complex beast (with support for a number of complex scenarios - referential integrity, constraints, computed values, on-the-fly extra columns, indexing, filtering, etc). In a lot of cases you don't need all that; you just want the data. To do that, a simple object model can be more efficient, both in memory and performance. You could write your own code around IDataReader, but that is a solved problem, with a range of tools that do that for you. For example, you could do that via dapper with just:
class SomeTypeOfRow { // define something that looks like the results
public int Id {get;set;}
public string Name {get;set;}
//..
}
...
var rows = connection.Query<SomeTypeOfRow>("StoredProcedureName",
/* optionalParameters, */ commandType: CommandType.StoredProcedure).ToList();
which then very efficiently populates a List<SomeTypeOfRow>, without all the DataTable overheads. Additionally, if you are dealing with very large volumes of data, you can do
this in a fully streaming way, so you don't need to buffer 2M rows in memory:
var rows = connection.Query<SomeTypeOfRow>("StoredProcedureName",
/* optionalParameters, */ commandType: CommandType.StoredProcedure,
buffered: false); // an IEnumerable<SomeTypeOfRow>
For completeness, I should explain optionalParameters; if you wanted to pass #id=1, #name="abc", that would be just:
var rows = connection.Query<SomeTypeOfRow>("StoredProcedureName",
new { id = 1, name = "abc" },
commandType: CommandType.StoredProcedure).ToList();
which is, I think you'll agree, a pretty concise way of describing the parameters. This parameter is entirely optional, and can be omitted if no parameters are required.
As an added bonus, it means you get strong-typing for free, i.e.
foreach(var row in rows) {
Console.WriteLine(row.Id);
Console.WriteLine(row.Name);
}
rather than having to talk about row["Id"], row["Name"] etc.