I'm trying to parse through e-mails in Outlook 2007. I need to streamline it as fast as possible and seem to be having some trouble.
Basically it's:
foreach( Folder fld in outllookApp.Session.Folders )
{
foreach( MailItem mailItem in fld )
{
string body = mailItem.Body;
}
}
and for 5000 e-mails, this takes over 100 seconds. It doesn't seem to me like this should be taking anywhere near this long.
If I add:
string entry = mailItem.EntryID;
It ends up being an extra 30 seconds.
I'm doing all sorts of string manipulations including regular expressions with these strings and writing out to database and still, those 2 lines take 50% of my runtime.
I'm using Visual Studio 2008
Doing this kind of thing will take a long time as you having to pull the data from the exchange store for each item.
I think that you have a couple of options here..
Process this information out of band use CDO/RDO in some other process.
Or
Use MapiTables as this is the fastest way to get properties there are caveats with this though and you may be doing things in your processin that can be brought into a table.
Redemption wrapper - http://www.dimastr.com/redemption/mapitable.htm
MAPI Tables http://msdn.microsoft.com/en-us/library/cc842056.aspx
I do not know if this will address your specific issue, but the latest Office 2007 service pack made a synificant performance difference (improvement) for Outlook with large numbers of messages.
Are you just reading in those strings in this loop, or are you reading in a string, processing it, then moving on to the next? You could try reading all the messages into a HashTable inside your loop then process them after they've been loaded--it might buy you some gains.
Any kind of UI updates are extremely expensive; if you're writing out text or incrementing a progress bar it's best to do so sparingly.
We had exactly the same problem even when the folders were local and there was no network delay.
We got 10x speedup by storing a copy of every email in a local Sql Server CE table tuned for the search we needed. We also used update events to make sure the local database remains in sync with the Outlook/Exchange folders.
To totally eliminate user lag we took the search out of the Outlook thread and put it in its own thread. The perception of lagging was worse than the actual delay it seems.
I had encountered a similar situation while trying to access Outlook mails via VBA(in excel).
However, it was far more slower in my case: 1 E-mail per sec!(Maybe it was slower in mine than in your case due to the fact that I had it implemented on VBA).
Anyway, I successfully managed to improve the speed by using the SetColumnns(eg. https://learn.microsoft.com/en-us/office/vba/api/Outlook.Items.SetColumns)
I know.. I Know.. This only works for a few properties, like "Subject" and "ReceivedTime" and not for the body!
But think again, do you really want to read through the body of all your emails? or is it just a subset? maybe based on its 'Subject' line or 'ReceivedTime'?
My requirement was to just go into the body of the email in case its subject matched a specific string!
Hence, I did the below:
I had added a second 'Outlook.Items' obj called 'myFilterItemCopyForBody' and applied the same filter I had on the other 'Outlook.Items'.
so, now I have two 'Outlook.Items' : 'myFilterItem' and 'myFilterItemCopyForBody' both with the same E-mail items since the same Restrict conditions are applied on both.
'myFilterItem'- to hold only 'Subject' and 'ReceivedTime' properties of the relevant mails (done by using SetColumns)
'myFilterItemCopyForBody'- to hold all the properties of the mail(including Body)
Now, both 'myFilterItem' and 'myFilterItemCopyForBody' are sorted with 'ReceivedTime' to have them in the same order.
Once sorted, both are looped simultaneously in a nested for each loop and pick corresponding properties (with the help of a counter) as in the code below.
Dim myFilterItem As Outlook.Items
Dim myItems As Outlook.Items
Set myItems = olFldr.Items
Set myFilterItemCopyForBody = myItems.Restrict("#SQL=""urn:schemas:httpmail:datereceived"" > '" & startTime & "' AND ""urn:schemas:httpmail:datereceived"" < '" & endTime & "'")
Set myFilterItem = myItems.Restrict("#SQL=""urn:schemas:httpmail:datereceived"" > '" & startTime & "' AND ""urn:schemas:httpmail:datereceived"" < '" & endTime & "'")
myFilterItemCopyForBody.Sort ("ReceivedTime")
myFilterItem.Sort ("ReceivedTime")
myFilterItem.SetColumns ("Subject, ReceivedTime")
For Each myItem1 In myFilterItem
iCount = iCount + 1
For Each myItem2 In myFilterItemCopyForBody
jCount = jCount + 1
If iCount = jCount Then
'Display myItem2.Body if myItem1.Subject contain a specific string
'MsgBox myItem2.Body
jCount = 0
Exit For
End If
Next myItem2
Next myItem1
Note1: Notice that the Body property is accessed using the 'myItem2' corresponding to 'myFilterItemCopyForBody'.
Note2: The lesser the number of times the compiler enters the loop to access the body property, the better! You can further improve the efficiency by playing with the Restrict and the logic to lower down the number of times the compiler has to loop through the logic.
Hope this helps, even though this is not something new!
Related
Quite some time ago I created a Powershell script to do what I wrote in the title: Copying a selection from File-A to File-B. The way I need it (and the way I did it) is to open a "Template" file when the application starts, the user selects a range, then they have 4 option buttons. The most-used is "Copy Selection", which will copy the contents of the range selection to all of the Excel files in a specific directory. With Powershell this works (although it took a long time to figure out) with the following code:
$strRow = $Excel.ActiveCell.Row
$strColumn = $Excel.ActiveCell.Column
$Range = $sourceWorksheet.Cells.item($strRow,$strColumn)
$Range.Select
$Excel.Selection.Copy() | Out-Null
foreach ($item in $files)
{
$destinationPath = $item.FullName
$destinationWorkBook = $Excel.WorkBooks.Open($destinationPath)
$destinationWorkSheet = $destinationWorkBook.worksheets.item(1)
$destRange = $destinationWorkSheet.Cells.item($strRow,$strColumn)
$destRange.Select
$destRange.PasteSpecial() | Out-Null
$destinationWorkBook.Close($true)
}
That works fine, but I obviously want to get rid of Powershell. I've created a C# application using WPF that will look and work great, I just have to get the logic working. I can't seem to actually target a range selection. I've tried a number of methods and properties, and none seem to get what I need. I can get the actual "selection" cell, but I can't get any type of range. The object I retrieve generally have Rows and Columns properties, so if need-be I guess I could extrapolate that information...but there's gotta be a better way.
Right now the closest thing I have would be to use Application.ActiveCell.Copy(); (or Select), but it seems to have issues when I try pasting it.
Any ideas? I can't seem to figure out why this isn't working.
EDIT: I've solved my own issue...sorry if anyone wasted time looking at this.
I'm an idiot...I spent a bit more time and figured out how to do it. The below code works perfectly - the main issue I was running into is that using Worksheet.Cells.item requires a bracket. I still have to loop through the documents to apply to, but this should take care of the actual logic:
int intRow = myApp.ActiveCell.Row;
int intCol = myApp.ActiveCell.Column;
myApp.Selection.Copy();
Range destRange = mySheet.Cells.Item[intRow,intCol];
destRange.PasteSpecial();
Hope I didn't waste anyone's time and they can figure this out now!
I am trying to monitor interfaces bandwidth on a remote Windows machine. So far I used SNMP with the Cisco Bandwidth Formula but that requires to retrieve two samples at two different times. Last but not least it seems that the value I record with SNMP is quite wrong. Since I have WMI support I'd like to use it but the only value I've found (which seems to be what I'm looking for) is BytesTotalPerSec of the Win32_PerfRawData_Tcpip_NetworkInterface. That value however looks more like a total counter (just like the SNMP one). Is there a way to retrieve the instant current bandwidth through WMI? To clarify the Current Bandwidth field always return 1000000000 (which is the Maximum Bandwidth) and as you can imagine it is not helpful.
Performance counter data is exposed in 2 places, Win32_PerfRawData* and Win32_PerfFormattedData*. The former contains raw data, the latter contains derived statistics, and is what you're after.
What you typically see in perfmon (for example) is the Win32_PerfFormattedData* data.
Try this :
Set objWMI = GetObject("winmgmts://./root\cimv2")
set objRefresher = CreateObject("WbemScripting.Swbemrefresher")
Set objInterfaces = objRefresher.AddEnum(objWMI, _
"Win32_PerfFormattedData_Tcpip_NetworkInterface").ObjectSet
While (True)
objRefresher.Refresh
For each RefreshItem in objRefresher
For each objInstance in RefreshItem.ObjectSet
WScript.Echo objInstance.Name & ";" _
& objInstance.BytesReceivedPersec & ";" _
& objInstance.BytesSentPersec
Next
Next
Wscript.Echo
Wscript.Sleep 1000
Wend
From experience, taking a measurement for a given second is pretty useless unless you're collecting the metric every second.
If you wanted the minutely bandwidth, you could derive it yourself from the raw data by taking 2 samples (you have to do this on Windows 2000 anyway)
See the windows 2000 section here if that makes more sense.
Derived stats on Windows 2000
There's an excellent article here Make your own Formatted Performance Data Provider
If you wanted to delve into collecting more statistical information over a longer sampling interval
John
I am currently running nginx on my windows system and am making a little control panel to show statistics of my web server.
I'm trying to get the performance counters for the CPU Usage and Memory Usage for the process but nginx shows as more than one process, it can vary from 2 - 5 depending on the setting in the configuration file. My setting shows two processes, so nginx.exe and nginx.exe
I know what performance counters to use, % Processor Time and Working Set - Private but how would I be able to get the individual values of both processes so i can add them together for a final value?
I tried using the code found at Waffles question but it only could output the values for the first process out of the two.
Thanks.
EDIT - Working Code
for (int i = 0; i < instances.Length; i++)
{
//i = i + 1;
if (i == 0)
{
toPopulate = new PerformanceCounter
("Process", "Working Set - Private",
toImport[i].ProcessName,
true);
}
else
{
toPopulate = new PerformanceCounter
("Process", "Working Set - Private",
toImport[i].ProcessName + "#" + i,
true);
}
totalNginRam += toPopulate.NextValue();
instances[i] = toPopulate;
}
Look at the accepted answer to that question. Try running perfmon. Processes that have the same names will be identified as something like this process#1, process#2, etc. In your case it could be nginx#1, nginx#2, etc.
Edit:
You need to pass the instance name to either the appropriate constructor overload or the InstanceName property. According to this, it looks like the proper format is to use underscore. So, process_1, process_2.
When using Azure Log Analytics, you can specify a path such as
Process(nginx*)\% Processor Time
This seems to be collecting data from all processes that match the wildcard pattern at any time. I can confirm that it picks up data from new processes (started after changing the settings) and it does not pick up data from "dead" processes. However, the InstanceName (such as nginx#3) may be reused, making it hard to tell when a process was "replaced" by a new one.
I have not been able to do this in Performance Monitor. The closest thing is to type "nginx*" in the search box of the "Add Counters" dialog, then select <All searched instances>. This will create one counter per process, and counters will not be dynamically added or removed as processes are started or stopped.
Perhaps it can be done with data collector sets created via PowerShell. However, even if you are able to set a path with a wildcard in the instance part, it is not guaranteed that it will behave as you expect (i.e., automatically collect data from all processes that are running at any time).
I'm trying to send faxes through RightFax in an efficient manner.
My users need to fax PDFs and even though the application is working fine, it is very slow for bulk sending (> 20 recipients, taking abt 40 seconds per fax).
// Fax created
fax.Attachments.Add(#"C:\\Test Attachments\\Products.pdf", BoolType.False);
fax.Send();
RightFax has this concept of *Library Documents, so what I thought we could do was to store a PDF document as a Library Document on the server and then reuse it, so there is no need to upload this PDF for n users.
I can create Library Documents without problems (I can retrieve them, etc.), but how do I add a PDF to this? (I have rights on the server.)
LibraryDocument doc2 = server.LibraryDocuments.Create;
doc2.Description = "Test Doc 1";
doc2.ID = "568"; // tried ints everything!
doc2.IsPublishedForWeb = BoolType.True;
doc2.PageCount = 2;
doc2.Save();
Also, once I created a fax, the API gives you an option to "StoreAsNewLibraryDocument", which is throwing an exception when run. System.ArgumentException: Value does not fall within the expected range
fax.StoreAsNewLibraryDocument("PRODUCTS","the products");
What matters for us is how to send say 500 faxes in the most efficient way possible using the API through RFCOMAPILib. I think that if we can reuse the PDF attached, it would greatly improve perfomance. Clearly, sending a fax in 40 seconds is unacceptable when you have hundreds of recipients.
How do we send faxes with attachments in the most efficient mode through the API?
StoreAsNewLibraryDocument() is the only practical way to store LibraryDocuments using the RightFax COM API, but assuming you're not using a pre-existing LibraryDocument, you have to call the function immediately after sending the first fax, which will have a regular file (not LibraryDoc) attachment.
(Don't create a LibraryDoc object on the server yourself, as you do above - you'd only do that if you have an existing file on the server that isn't a LibraryDocument, and you want to make it into one. You'll probably never encounter such a scenario.)
The new LibraryDocument is then referenced (in subsequent fax attachments) by the ID string you specify as the first argument of StoreAsNewLibraryDocument(). If that ID isn't unique to the RightFax Server's LibraryDocuments collection, you'll get an error. (You could use StoreAsLibraryDocumentUpdate() instead, if you want to actually replace the file on the server.) Also, remember to always specify the AttachmentType.
In theory, this should be all you really have to do:
' First fax:
fax.Attachments.Add(#"C:\\Test Attachments\\Products.pdf", BoolType.False);
fax.Attachments.Item(1).AttachmentType = AttachmentType.aFile;
fax.Send();
fax.StoreAsNewLibraryDocument("PRODUCTS", "The Products");
server.LibraryDocuments("PRODUCTS").IsPublishedForWeb = BoolType.True;
' And for all subsequent faxes:
fax.Attachments.Add(server.LibraryDocuments("PRODUCTS"));
fax.Attachments.Item(1).AttachmentType = AttachmentType.aLibraryDocument;
fax.Send();
The reason I say "in theory" is because this doesn't always work. Sometimes when you call StoreAsNewLibraryDocument() you end up with a LibraryDoc with a PageCount of zero. This happens seemingly at random, and is probably due to a bug in RightFax, or possibly a server misconfiguration. So it's a very good idea to check for...
server.LibraryDocuments("PRODUCTS").PageCount = 0
...before you send any of the subsequent faxes, and if necessary retry until it works, or (if it won't) store the LibraryDoc some other way and give up on StoreAsNewLibraryDocument().
Whereas, if you don't have that problem, you can usually send a mass-fax in about a 1/10th of the time it takes when you attach (and upload) the local file each time.
If someone from OpenText/RightFax reads this and can explain why StoreAsNewLibraryDocument() sometimes results in zero-page faxes, an additional answer about that would be appreciated quite a bit!
I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as username1#gmail.com, username1a#gmail.com, username1b#gmail.com, etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?
The test code I'm using now is:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;
namespace LevenshteinAnalyzer
{
class Program
{
const string INPUT_FILE = #"C:\Input.txt";
const string OUTPUT_FILE = #"C:\Output.txt";
static void Main(string[] args)
{
var inputWords = File.ReadAllLines(INPUT_FILE);
var outputWords = new SortedSet<string>();
for (var i = 0; i < inputWords.Length; i++)
{
if (i % 100 == 0)
Console.WriteLine("Processing record #" + i);
var word1 = inputWords[i].ToLower();
for (var n = i + 1; n < inputWords.Length; n++)
{
if (i == n) continue;
var word2 = inputWords[n].ToLower();
if (word1 == word2) continue;
if (outputWords.Contains(word1)) continue;
if (outputWords.Contains(word2)) continue;
var distance = LevenshteinAlgorithm.Compute(word1, word2);
if (distance <= 2)
{
outputWords.Add(word1);
outputWords.Add(word2);
}
}
}
File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
Console.WriteLine("Found {0} words", outputWords.Count);
}
}
}
Edit: Some of the stuff I'm trying to catch looks like:
01234567890#gmail.com
0123456789#gmail.com
012345678#gmail.com
01234567#gmail.com
0123456#gmail.com
012345#gmail.com
01234#gmail.com
0123#gmail.com
012#gmail.com
You could start by applying some prioritization to which emails to compare to one another.
A key reason for the performance limitations is the O(n2) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.
For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.
You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.
Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.
Well you can make some optimizations, assuming that the Levenshtein difference is your bottleneck.
1) With a Levenshtein distance of 2, the emails are going to be within 2 characters length of one another, so don't bother to do the distance calculations unless abs(length(email1)-length(email2)) <= 2
2) Again, with a distance of 2, there are not going to be more than 2 characters different, so you can make HashSets of the characters in the emails, and take the length of the union minus the length of the intersection of the two. (I believe this is a SymmetricExceptWith) If the result is > 2, skip to the next comparison.
OR
Code your own Levenshtein distance algorithm. If you are only interested in lengths < k, you can optimize the run time. See "Possible Improvements" on the Wikipedia page: http://en.wikipedia.org/wiki/Levenshtein_distance.
You could add a few optimizations:
1) Keep a list of known frauds and compare to that first. After you get going in your algorithm, you might be able hit against this list faster than you hit the main list.
2) Sort the list first. It won't take too long (in comparison) and will increase the chance of matching the front of the string first. Have it sort by domain name first, then by username. Perhaps put each domain in its own bucket, then sort and also compare against that domain.
3) Consider stripping the domain in general. spammer3#gmail.com and spammer3#hotmail.com will never trigger your flag.
If you can define a suitable mapping to some k-dimensional space, and a suitable norm on that space, this reduces to the All Nearest Neighbours Problem which can be solved in O(n log n) time.
Finding such a mapping, however, might be difficult. Maybe someone will take this partial answer and run with it.
Just for completeness, you should consider the semantics of email addresses as well, in terms of:
Gmail treats user.name and username as being the same, so both are valid email addresses belonging to the same user. Other services may do this as well. LBushkin's suggestion to strip special characters would help here.
Sub-adrressing can potentially trip your filter if users wise up to it. You'd want to drop the sub-address data before comparison.
You might want to look at the full data set to see if there is other commonality between accounts that have spoofed emails.
i don't know what your application does, but if there are other key points, then use those to filter down what addresses you are going to compare.
Sort everything into a hashtable first. The key should be the domain name of the email; "gmail.com". Strip out special characters from the values, as was mentioned above.
Then check all the gmail.com's against one another. That should be much faster. Do not compare things that are more than 3 characters different in length.
As a second step, check all the keys against one another, and develop groupings there. (gmail.com == googlemail.com, for example.)
I agree with others comments about comparing email addresses not being to helpful, since users could just aswell create fraudulent disimilar looking addresses.
I think a better to come with other solutions, such as limiting the amount of emails you can write down per hour/day, or the time between those addresses being received by you and being sent to the users. Basically, work it out in a way where it is comfortable to send to send a few invites per day, but a PITA to send out many. I guess most users would forget/give up to do it if they had to do it through out a relatively long period of time in order to get their freebies.
Is there any way you can do a check on the IP address of the person creating the email. That would be a simple way to determine, or at least give you added information about whether the different email addresses have come from the same person.