Is there a faster way to access objects exposed by VSTO/Excel - c#

I'm writing a lot of Office add-ins in C#, and I love all the wonderful ways you can extend the functionality of especially Excel. But one thing that keeps bugging me is the overhead of doing pretty much anything to pretty much any Office object.
I'm aware that there are high-level tricks to doing many things faster, like reading and writing object[,] arrays to larger cell ranges instead of accessing individual cells, and so on. But regardless, a complicated add-in will always end up accessing lots of different objects, or many properties of a few objects, or the same properties over and over again.
And when profiling my add-ins I always find I spend at least 90% of my CPU time accessing basic properties of Office objects. For instance, here is a bit of code I use to check if a window has been scrolled, so I can update some overlay graphics accordingly:
Excel.Window window = Globals.ThisAddIn.Application.ActiveWindow;
if (window.ScrollColumn != previousScrollColumn)
{
needsRedraw = true;
previousScrollColumn = window.ScrollColumn;
}
if (window.ScrollRow != previousScrollRow)
{
needsRedraw = true;
previousScrollRow = window.ScrollRow;
}
if (window.Zoom != previousZoom)
{
needsRedraw = true;
previousZoom = window.Zoom;
}
The first line, getting the active window, and each of the if statements, each accessing a property of that window, all light up when profiling. They're really slow.
Now I know these are COM objects in managed wrappers, and there's some sort of managed->unmanaged interface stuff going on, probably inter-process communication and whatnot, so I'm not surprised that there's some overhead, but I'm still amazed at how much it adds up.
So are there any tricks for speeding stuff like this up?
For instance, in the above case I'm accessing three properties of the same object. I can't help but think there must be some way to read them all in one go, like maybe via a native companion add-in or something...?
Any ideas?

If you can get the Open XML, you can load it and traverse it using the Open XML SDK or other related libraries. Word has this (Range.WordOpenXML) but I don't know if Excel does. Even then, it might be that not all properties are exposed, for example the scroll location is probably not there.

Related

Applying business rules to an object in a SOLID way

I've been driving myself crazy for hours trying to figure this one out, and I'm not moving anywhere with it.
I'm creating a checkout till for a cashier, specifically I need to sum the items, then apply the promotional discounts. I'm trying to do it without violating any design principles (impossible, I know, I can let things slide when it makes sense).
Promotional discounts could be anything, from a black friday deal flat discount, to 'Orders over £100 save 10%' to '3 for 2 on these items', or 'Buy at least two cans of coke, and the price for them all drops to £0.50!'
I cannot see how to fit the promotional deals in. Each may require a different set of data from different locations. For instance one of the big problems being the '3 for 2' deal. Getting access to the items in the Checkout has been a plague on my mind.
So far, my best approach has been to use the Decorator pattern. Wrap the checkout up in a bunch of promotional deals when the price is calculated, as each decorator holds an instance of the checkout, we'll have access to the orginal checkout with the list of items.
In the future, the only thing I'd need to do is write the new rule, add it to the factory, and update any DB data which is perfect, the minimal change.
This kind of works. I can justify it in my head that it's still a checkout, and therefore all the rules being able to access the checkout make sense, gives me a nice way to chain the discount. But there is a problem, in that I'm sure I shouldn't be using it the way I'm suggesting. For instance, if one of the promotions wears off, you shouldn't really 'unwrap' it, and realistically while it's nice to be able to add promotions dynamically to extend an instance, it's not necessary.
I've read through more design patterns but can't seem to find anything that applies. I saw the following article:
https://levelup.gitconnected.com/rules-design-pattern-in-c-6c62f0e20ee0
This is basically what I want to do, but the implementation feels clumsy to me.
public bool IsValid(FileInfo fileInfo)
{
var rules = new List<IFileValidationRule> { new FileExtensionRule(new string[] { "txt", "html" }) };
if (AdminConfig.CheckFileSize)
{
rules.Add(new FileSizeRule("txt", 5 * 1024 * 1024));
rules.Add(new FileSizeRule("html", 10 * 1024 * 1024));
}
if (User.Status != UserStatus.Premium)
{
rules.Add(new MaxFileLengthRule(50));
}
bool isValid = rules.All(rule => rule.IsValid(fileInfo));
return isValid;
}
Specifically this part, seems to violate a few key principles, Open-Closed principle, Dependency Inversion, etc.
The other big problem I can't wrap my head arround is as below:
Imagine for the above example, a new rule needs to be added that reads the file data, check if there are any bad characters in there, doesn't matter what.
Implementing this is easy, you inject the file or the file data, whatever into the 'FileValidator' class, you then instantiate your rule and pass the file data into it. You then run the rule and return the success, great! But is this ok?
Reading this says no: http://wiki.c2.com/?TellDontAsk
"Tell, don't ask" - It is okay to use accessors to get the state of an object, as long as you don't use the result to make decisions outside the object.
That would be exactly what the code is doing! I guess the alternative to this is to update the 'FileData' object to essentially take a list of bad characters, check the file data, return false to the rule, which then fails the whole process, but this would start to throw a bunch more rules out the door. You're now breaking the Open-Closed principle, Single Responsibility principle, and it feels like you're building a rod for your own back, adding these custom methods for singular rules, bloating your object. (The link does discuss how you can pass a function into the method, which is pretty nice, but still not perfect, at the end of the day, aren't you just indirectly handing control to the caller?)
The above alone wouldn't be enough to stop me, but I'm struggling to justify making a private set of items public so one rule out of the bunch can make use of that data.
I'm in OOP recursion, tumbling towards a stack overflow. Can anyone pull me out and help me consilidate my thoughts? None of the design patterns seem to work but I'm sure this is a basic problem solved many times in the past. What am I missing?

Releasing COM objects in .NET from within enumerations and other reference scenarios

I have a WinForms application that uses COM Interop to connect to Microsoft Office applications. I have read a great deal of material regarding how to properly dispose of COM objects and here is typical code from my application using techniques from Microsoft's own article (here):
Excel.Application excel = new Excel.Application();
Excel.Workbook book = excel.Workbooks.Add();
Excel.Range range = null;
foreach (Excel.Worksheet sheet in book.Sheets)
{
range = sheet.Range["A2:Z2"];
// Process [range] here.
range.MergeCells();
System.Runtime.InteropServices.Marshal.ReleaseComObject(range);
range = null;
}
// Release explicitly declared objects in hierarchical order.
System.Runtime.InteropServices.Marshal.ReleaseComObject(book);
System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
book = null;
excel = null;
// As taken from:
// http://msdn.microsoft.com/en-us/library/aa679807(v=office.11).aspx.
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
System.GC.Collect();
System.GC.WaitForPendingFinalizers();
All exception handling has been stripped to make the code clearer for this question.
What happens to the [sheet] object in the [foreach] loop? Presumably, it will not get cleaned up, nor can we tamper with it while it is being enumerated. One alternative would be to use an indexing loop but that makes for ugly code and some constructs in the Office Object Libraries do not even support indexing.
Also, the [foreach] loop references the collection [book.Sheets]. Does that leave orphaned RCW counts as well?
So two questions here:
What is the best approach to clean up when enumerating is necessary?
What happens to the intermediate objects like [Sheets] in [book.Sheets] since they are not explicitly declared or cleaned up?
UPDATE:
I was surprised by Hans Passant's suggestion and felt it necessary to provide some context.
This is client/server application where the client connects to many different Office apps including Access, Excel, Outlook, PowerPoint and Word among others. It has over 1,500 classes (and growing) that test for certain tasks being performed by end-users as well as simulate them in training mode. It is used to train and test students for Office proficiency in academic environments. With multiple developers and loads of classes, it has been a difficult to enforce COM-friendly coding practices. I eventually resorted to create automated tests using a combination of reflection and source code parsing to ensure the integrity of these classes at a pre-code-review stage.
Will give Hans' suggestion a try and revert back.
Enumerating
Your sheet loop variable is, indeed, not being released. When writing interop code for excel you have to constantly watch your RCWs. In preference to using foreach enumertions I tend to use for as it makes me realise whenever I've grabbed a reference by having to explicitly declare the variable. If you must enumerate, then at the end of the loop (before you leave the loop) do this:
if (Marshal.IsComObject(sheet)) {
Marshal.ReleaseComObject(sheet);
}
And, be careful of continue and break statements that leave the loop before you have released your reference.
Intermediates
It depends on whether or not the intermediate is actually a COM object (book.Sheets is) but if it is then you need to first get a reference to it in a field, then enumerate that reference, and then ensure you dispose of the field. Otherwise you are essentially "double dotting" (see below):
using xl = Microsoft.Office.Interop.Excel;
...
public void DoStuff () {
...
xl.Sheets sheets = book.Sheets;
bool sheetsReleased = false;
try {
...
foreach (xl.Sheet in sheets) { ... try, catch and dispose of sheet ... }
... release sheets using Marshal.ReleaseComObject ...
sheetsDisposed = true;
}
catch (blah) { ... if !sheetsDisposed , dispose of sheets ... }
}
The above code is the general pattern (it gets lengthy if you type it in full so I have only focussed on the important parts)
What about errors?
Be fastidious in your use of try ... catch ... finally. Make sure that you use this very carefully. finally does not always get called in the case of things like stack overflow, out of memory, security exceptions, so if you want to ensure you clean up, and don't leave phantom excel instances open if your code crashes, then you must conditionally execute reference releasing in the catch before exceptions are thrown.
Therefore, inside every foreach or for loop, you also need to use try ... catch ... finally to make sure that the enumeration variable is released.
Double dotting
Also do not "double dot" (only use a single period in lines of code). Doing this in foreach is a common mistake that is easy for us to do. I still catch myself doing it if I've been off doing non-COM C# for a while as it is more and more common to chain periods together due to LINQ style expressions.
Examples of double dotting:
item.property.propertyIWant
item.Subcollection[0] (you are calling SubCollection before then calling an indexer property on that subcollection)
foreach x in y.SubCollection (essentially you are calling SubCollection.GetEnumerator, so you are "double dotting" again)
Phantom Excel
The big test, of course, is to see if Excel remains open in the task manager once your program exits. If it does then you probably left a COM reference open.
References
You say you have researched this heavily, but in case it helps, then a few of the references that I found helpful were:
The mapping between interface pointers and runtime callable wrappers (RCWs)
VSTO and COM Interop
ReleaseComObject (cbrumme)
Robust solutions
One of the above references mentions a helper he uses for foreach loops. Personally, if I'm doing more than a simple "script" project then I'll first spend time on developing a library specifically wrapping the COM objects for my scenario. I have a common set of classes now that I reuse, and I've found that the time invested in setting that up before doing anything else is more than recovered in not having to hunt down unclosed references later on. Automated testing is also essential to help with this and reaps rewards for any COM interop, not just Excel.
Each COM object, such as Sheet, will be wrapped in a class that implements IDisposable. It will expose properties such as Sheets which in turn has an indexer. Ownership is tracked all the way through, and at the end if you simply dispose of the master object, such as the WorkbookWrapper, then everything else gets disposed of internally. Adding a sheet, for instance, is tracked so a new sheet will also be disposed of.
While this is not a bulletproof approach you can at least rely on it for 95 % of the use cases, and the other 5 % you are totally aware of and take care of in the code. Most importantly, it is tested and reusable once you have done it the first time.

Iterating through diff changes in LibGit2Sharp

What could be the best (as in performant, simple) way to iterate over TreeChanges in LibGit2Sharp?
If I access the .Patch property, I retrieve the full text of the changes. This is not quite enough for me... ideally I would like to be able to iterate over the diff lines, and per each line retrieve the status of the line (modified, added, deleted) and build my own output out of it.
Update:
Let's say I want to build my own diff output. What I'd like to do is to iterate over the changed lines, and during iteration I would check for the type of change (added, removed), and construct my output.
For example:
var diff = "";
foreach (LineChange line in changes) // Bogus class "LineChange"
{
if (line.Type == LineChange.TYPE_ADDED)
diff += "+";
else
diff += "-";
diff += line.Content;
diff += "\n";
}
The above is just a simple example what kind of flexibility I'm looking for. To be able to go through the changes, and run some logic along with it depending on the line change types. The Patch property is already "built", one way would be to parse it, but it seems silly that the library first builds the output, and then I parse it... I'd rather use the building ingredients directly.
I need this kind of functionality so that I can display a visual diff of changes which involves far more code and logic than the simple example I gave above.
As far as I can see, this information is not exposed by libgit2sharp, but it's provided by libgit2 in the case of blob diffs (but not for tree diffs). The relevant code is in ContentChanges.cs, specifically in the constructor and in the LineCallback() method (the code for tree diffs is in TreeChanges.cs).
Because of this, I think you have two options:
Invoke the method git_diff_blobs(), that's used internally by ContentChanges, yourself, either using reflection (it's an internal method in NativeMethods), or by copying the PInvoke signature to your project. You will most likely also need Utf8Marshaler.
Modify the code of ContentChanges, so that it fits your needs. If you do this, it might make sense to create a pull request for that change, so that others could use it too.
#svick is right. It's not exposed.
It might be useful to open an issue/feature request to further discuss this topic. Indeed, exposing a full blown line based diffgram might not fit the current "grain" of the library. However, provided you can come up with a scenario/use case that would benefit most of the users, some research may be invested in order to widen the API.
Beside this option, there might be other solutions: post-process the current produced patch against the previous version of the file
See this SO question for potential leads
Neil Fraser's "Diff Strategies" paper is also a great source of strategies and potential caveats regarding what a diff tool might aim at
DiffPlex, as a working visualization tool, might be inspirational as well
With some more work, one might even achieve something similar to the following kind of visualization (from Perforce 4 viewer)
(source: macworld.com)
Note: In order to ease this, it might be useful to expose in C# the libgit2 diffing options.

Is it possible to modify a ReadOnlyCollection using reflection

I'm dealing with an SDK that keeps references to every object it creates, as long as the main connection object is in scope. Creating a new connection object periodically results in other resource issues, and is not an option.
To do what I need to do, I must iterate through thousands of these objects (almost 100,000), and while I certainly don't keep references to these objects, the object model in the SDK I'm using does. This chews through memory and is dangerously close to causing OutOfMemoryExceptions.
These objects are stored in nested ReadOnlyCollections, so what I'm trying now, is to use reflection to set some of these collections to null when I'm done with them, so the garbage collector can harvest the used memory.
foreach (Build build in builds)
{
BinaryFileCollection numBinaries = build.GetBinaries();
foreach (BinaryFile binary in numBinaries)
{
this.CoveredBlocks += binary.HitBlockCount;
this.TotalBlocks += binary.BlockCount;
this.CoveredArcs += binary.HitArcCount;
this.TotalArcs += binary.ArcCount;
if (binary.HitBlockCount > 0)
{
this.CoveredSourceFiles++;
}
this.TotalSourceFiles++;
foreach (Class coverageClass in binary.GetClasses())
{
if (coverageClass.HitBlockCount > 0)
{
this.CoveredClasses++;
}
this.TotalClasses++;
foreach (Function function in coverageClass.GetFunctions())
{
if (function.HitBlockCount > 0)
{
this.CoveredFunctions++;
}
this.TotalFunctions++;
}
}
FieldInfo fi = typeof(BinaryFile).GetField("classes", BindingFlags.NonPublic | BindingFlags.Instance);
fi.SetValue(binary, null);
}
When I check the values of the classes member in numBinaries[0], it returns null, which seems like mission accomplished, but when I run this code the memory consumption just keeps going up and up, just as fast as when I don't set classes to null at all.
What I'm trying to figure out is whether there's something intrinsically flawed in this approach, or if there's another object keeping references to the classes ReadOnlyCollection that I'm missing.
I can think of a few alternatives...
Logically split it out. You mentioned it keeps all the references "for the duration of a connection". Can you do 10%, close it, open a new one, skip that 10%, take another 10% (total 20%), etc?
How much memory are we talking about here, and is this tool going to be something that is long-lived? So what if it uses a lot of RAM for a few minutes? Are you actually getting OOMs? If your system has that much available RAM for the program to use, why not use it? You paid for the RAM. This reminds me of one of Raymond Chen's blog posts about 100% CPU consumption.
If you really want to see what is keeping something from getting garbage collected, firing up SOS and using !gcroot is a place to start.
But despite all of that, if this really is a problem, I would spend more time with the 3rd party API provider - at some point they may release an update you want that breaks this - and you'll be back to square one, or worse you can introduce subtle bugs in the product.

Is it true I should not do "long running" things in a property accessor?

And if so, why?
and what constitutes "long running"?
Doing magic in a property accessor seems like my prerogative as a class designer. I always thought that is why the designers of C# put those things in there - so I could do what I want.
Of course it's good practice to minimize surprises for users of a class, and so embedding truly long running things - eg, a 10-minute monte carlo analysis - in a method makes sense.
But suppose a prop accessor requires a db read. I already have the db connection open. Would db access code be "acceptable", within the normal expectations, in a property accessor?
Like you mentioned, it's a surprise for the user of the class. People are used to being able to do things like this with properties (contrived example follows:)
foreach (var item in bunchOfItems)
foreach (var slot in someCollection)
slot.Value = item.Value;
This looks very natural, but if item.Value actually is hitting the database every time you access it, it would be a minor disaster, and should be written in a fashion equivalent to this:
foreach (var item in bunchOfItems)
{
var temp = item.Value;
foreach (var slot in someCollection)
slot.Value = temp;
}
Please help steer people using your code away from hidden dangers like this, and put slow things in methods so people know that they're slow.
There are some exceptions, of course. Lazy-loading is fine as long as the lazy load isn't going to take some insanely long amount of time, and sometimes making things properties is really useful for reflection- and data-binding-related reasons, so maybe you'll want to bend this rule. But there's not much sense in violating the convention and violating people's expectations without some specific reason for doing so.
In addition to the good answers already posted, I'll add that the debugger automatically displays the values of properties when you inspect an instance of a class. Do you really want to be debugging your code and have database fetches happening in the debugger every time you inspect your class? Be nice to the future maintainers of your code and don't do that.
Also, this question is extensively discussed in the Framework Design Guidelines; consider picking up a copy.
A db read in a property accessor would be fine - thats actually the whole point of lazy-loading. I think the most important thing would be to document it well so that users of the class understand that there might be a performance hit when accessing that property.
You can do whatever you want, but you should keep the consumers of your API in mind. Accessors and mutators (getters and setters) are expected to be very light weight. With that expectation, developers consuming your API might make frequent and chatty calls to these properties. If you are consuming external resources in your implementation, there might be an unexpected bottleneck.
For consistency sake, it's good to stick with convention for public APIs. If your implementations will be exclusively private, then there's probably no harm (other than an inconsistent approach to solving problems privately versus publicly).
It is just a "good practice" not to make property accessors taking long time to execute.
That's because properties looks like fields for the caller and hence caller (a user of your API that is) usually assumes there is nothing more than just a "return smth;"
If you really need some "action" behind the scenes, consider creating a method for that...
I don't see what the problem is with that, as long as you provide XML documentation so that the Intellisense notifies the object's consumer of what they're getting themselves into.
I think this is one of those situations where there is no one right answer. My motto is "Saying always is almost always wrong." You should do what makes the most sense in any given situation without regard to broad generalizations.
A database access in a property getter is fine, but try to limit the amount of times the database is hit through caching the value.
There are many times that people use properties in loops without thinking about the performance, so you have to anticipate this use. Programmers don't always store the value of a property when they are going to use it many times.
Cache the value returned from the database in a private variable, if it is feasible for this piece of data. This way the accesses are usually very quick.
This isn't directly related to your question, but have you considered going with a load once approach in combination with a refresh parameter?
class Example
{
private bool userNameLoaded = false;
private string userName = "";
public string UserName(bool refresh)
{
userNameLoaded = !refresh;
return UserName();
}
public string UserName()
{
if (!userNameLoaded)
{
/*
userName=SomeDBMethod();
*/
userNameLoaded = true;
}
return userName;
}
}

Categories