I need a pattern for processing incoming emails.
My current pseudo-code is like this:
if sender is a#a.com and messageBody contains "aaa" then
extract the content according the aaa function
save it to database
move the message to the archive
else if messageBody contains "bbb" then
extract the content according to bbb function
save it to database
inform sender
move the message to archive
else if messageBody NOT contains "ccc" and from "sender#ccc.com" and then
leave message in the inbox so the it will be manually processed
else if ...
...
So, I ended up with a pig function with thousands of lines.
How can I make this thing simpler?
Thanks in advance
A very good architecture is required for solving this problem, machine learning is one of the best solution for this kind of problems. But still there are some things that you can take care of, in order to make it simpler.
Rather than putting 10 ifs for 10 email-ids, create a list of unwanted users (which will go in spam)
Create a list of unwanted subjects
Create groups of time intervals and process emails accordingly; morning emails, noon emails, evening emails etc
create has attachment check
no subject check
no body check
friends email list
same domain email-id sender check
Thanks. :)
Related
I am working on a project where the users have to put in the physical address of their organization, in many cases users will put in a PO Box rather than their physical address. I need a way in C# to determine whether or not a user put in a P.O. Box or PO Box (or any other variation of this) rather 29 Maple Street style address. I have had a few thoughts, but I thought I would get some really great feedback here.
Thanks
I would try to parse the address as a string. Then find a 'P.O. Box' or 'PO Box' in the array: if it finds it, the PO BOX should be the next element(s).
You will also need a way to detect the city so you know when to stop. You could use google's geonames (http://www.geonames.org/) as a data base.
I'm really interested in the Numl.net library to scan incoming email and extract bits of data. As an example, let's imagine I want to extract a customer reference number from an email, which could be in the subject line or body content.
void Main()
{
// get the descriptor that describes the features and label from the training objects
var descriptor = Descriptor.Create<Email>();
// create a decision tree generator and teach it about the Email descriptor
var decisionTreeGenerator = new DecisionTreeGenerator(descriptor);
// load the training data
var repo = new EmailTrainingRepository(); // inject this
var trainingData = repo.LoadTrainingData(); // returns List<Email>
// create a model based on our training data using the decision tree generator
var decisionTreeModel = decisionTreeGenerator.Generate(trainingData);
// create an email that should find C4567890
var example1 = new Email
{
Subject = "Regarding my order C4567890",
Body = "I am very unhappy with your level of service. My order has still not arrived."
};
// create an email that should find C89779237
var example2 = new Email
{
Subject = "I want to return my goods",
Body = "My customer number is C89779237 and I want to return my order."
};
// create an email that should find C3239544-1
var example3 = new Email
{
Subject = "Customer needs an electronic invoice",
Body = "Please reissue the invoice as a PDF for customer C3239544-1."
};
var email1 = decisionTreeModel.Predict<Email>(example1);
var email2 = decisionTreeModel.Predict<Email>(example2);
var email3 = decisionTreeModel.Predict<Email>(example3);
Console.WriteLine("The example1 was predicted as {0}", email1.CustomerNumber);
if (ReadBool("Was this answer correct? Y/N"))
{
repo.Add(email1);
}
Console.WriteLine("The example2 was predicted as {0}", email2.CustomerNumber);
if (ReadBool("Was this answer correct? Y/N"))
{
repo.Add(email2);
}
Console.WriteLine("The example3 was predicted as {0}", email3.CustomerNumber);
if (ReadBool("Was this answer correct? Y/N"))
{
repo.Add(email3);
}
}
// Define other methods and classes here
public class Email
{
// Subject
[Feature]
public string Subject { get; set; }
// Body
[Feature]
public string Body { get; set; }
[Label]
public string CustomerNumber { get; set; } // This is the label or value that we wish to predict based on the supplied features
}
static bool ReadBool(string question)
{
while (true)
{
Console.WriteLine(question);
String r = (Console.ReadLine() ?? "").ToLower();
if (r == "y")
return true;
if (r == "n")
return false;
Console.WriteLine("!!Please Select a Valid Option!!");
}
}
There are a few things I haven't quite grasped though.
In a supervised network, do I need to re-build the decision tree every time I run the application, or can I store it off somehow and then reload it as and when required? I'm trying to save the processing time in order to rebuild that decision tree every time.
Also, can the network continually add to it's own training data as the data gets validated by a human? I.e. we have an initial training set, the network decides on an outcome and if a human says 'well done' the new example gets added to the training set in order to improve it. Also vice versa when the network gets it wrong. I assume I can just add to the training set once a human has validated that a prediction is correct? Does my repo.Add(email) seem like a logical way to do this?
If I do add to the training data, at what point does the training data become "more than required"?
I don't think this is a good problem to solve using machine learning (although I am interested in your findings). My concerns would be that customer numbers change over time requiring you to regenerate the model each time. Binary classification algorithms such as Naïve Bayes, Decision Trees, Logistic Regression and SVMs require you to know ahead of time each class (i.e. Customer Ref No).
You could try using feature engineering and predicting whether a given word is or is not a customer reference number (i.e. 1 or 0). To do this you simply engineer features like the below:
IsWordStartsWithC (bool)
WordLength
Count of Digits / Word Length
Count of Letters / Word Length
Then use Decision Tree or Logistic Regression classifier to predict if the word is a CRN or not. To extract the CRN out of the email, simply iterate over each word in an email and if Model.Predict(word) outputs a 1 you hopefully have captured the CRN for that email.
This method should not need to be retrained.
In a supervised network, do I need to re-build the decision tree every time I run the application, or can I store it off somehow and
then reload it as and when required? I'm trying to save the processing
time in order to rebuild that decision tree every time.
You can store the generated model using any stream object via the Model.Save() method. All supervised models in numl currently implement this base class. Apart from the Neural Network model they should save fine.
Also, can the network continually add to it's own training data as the data gets validated by a human? I.e. we have an initial training
set, the network decides on an outcome and if a human says 'well done'
the new example gets added to the training set in order to improve it.
Also vice versa when the network gets it wrong. I assume I can just
add to the training set once a human has validated that a prediction
is correct? Does my repo.Add(email) seem like a logical way to do
this?
This is a good reinforcement learning example. At present numl doesn't implement this but hopefully it will in the near future :)
If I do add to the training data, at what point does the training data become "more than required"?
The best way to check this is through validation of the training and test set accuracy measures. You can keep adding more training data while the accuracy of the test set goes up. If you find that the accuracy goes down on the test set and continues to go up on the training set, it is now overfitting and it's safe to stop adding more data.
It's a little late, but I'm also learning the numl library, and I think I can shed some light on some of your questions.
In a supervised network, do I need to re-build the decision tree every time I run the application, or can I store it off somehow and
then reload it as and when required? I'm trying to save the processing
time in order to rebuild that decision tree every time.
There is currently a IModel.Save method that is supposed to be implemented in each class. However, as best I can tell it isn't yet implemented. There are, however, serialization tests that work for most models, including the DecisionTree, as shown in the DecisionTreeSerializationTests:
Serialize(model);
Which simply calls:
internal void Serialize(object o)
{
var caller = new StackFrame(1, true).GetMethod().Name;
string file = string.Format(_basePath, caller);
if (File.Exists(file)) File.Delete(file);
JsonHelpers.Save(file, o);
}
They have a bunch of custom created converters for using json serialization, and I think that this can be used until the model.Save is implemented. You basically just use the numl.Utils.JsonHelpers to serialize/deserialize the model to/from json (which you can persist however you want). Also, I think this is one thing that they are currently working on.
Also, can the network continually add to it's own training data as the data gets validated by a human? I.e. we have an initial training
set, the network decides on an outcome and if a human says 'well done'
the new example gets added to the training set in order to improve it.
Also vice versa when the network gets it wrong. I assume I can just
add to the training set once a human has validated that a prediction
is correct? Does my repo.Add(email) seem like a logical way to do
this?
You can always add data points to train your model at any point in time. However, I think you would have to retrain the model from scratch. There is Online Machine Learning that trains as data points come in individually, but I don't think numl currently implements this. So, to do this you would probably run a daily/weekly job depending on your requirements to retrain the model with the expanded training data.
If I do add to the training data, at what point does the training data become "more than required"?
The general rule is "more data means better prediction." You can always look to see at your gains and decide for yourself if you aren't getting benefit out of increasing your training data sample size. That said, this is not a hard and fast rule (go figure). If you just google "machine learning more data better accuracy" you will find a ton of information on the subject, all of which I cull down to "more data means better prediction" and "see what works best for you". In your particular example of training against email text, it is my understanding that more data will only help you.
All of that being said, I'd also say a couple other things:
It sounds like if you're just trying to get customer/order numbers from emails, a good regex would do you better than analyzing with ML. At the very least, I would have regexs a part of the feature set of your training data, so maybe you could be training it to learn typos or input variations of your structure.
I'm not an expert on ML, nor on numl. I just happen to be learning numl as well. They have so far been very responsive to me on gitter, which you and anyone else interested in the pretty awesome open-source, mature, MIT licensed library should definitely check out.
I'm looking for some suggestions on better approaches to handling a scenario with reading a file in C#; the specific scenario is something that most people wouldn't be familiar with unless you are involved in health care, so I'm going to give a quick explanation first.
I work for a health plan, and we receive claims from doctors in several ways (EDI, paper, etc.). The paper form for standard medical claims is the "HCFA" or "CMS 1500" form. Some of our contracted doctors use software that allows their claims to be generated and saved in a HCFA "layout", but in a text file (so, you could think of it like being the paper form, but without the background/boxes/etc). I've attached an image of a dummy claim file that shows what this would look like.
The claim information is currently extracted from the text files and converted to XML. The whole process works ok, but I'd like to make it better and easier to maintain. There is one major challenge that applies to the scenario: each doctor's office may submit these text files to us in slightly different layouts. Meaning, Doctor A might have the patient's name on line 10, starting at character 3, while Doctor B might send a file where the name starts on line 11 at character 4, and so on. Yes, what we should be doing is enforcing a standard layout that must be adhered to by any doctors that wish to submit in this manner. However, management said that we (the developers) had to handle the different possibilities ourselves and that we may not ask them to do anything special, as they want to maintain good relationships.
Currently, there is a "mapping table" set up with one row for each different doctor's office. The table has columns for each field (e.g. patient name, Member ID number, date of birth etc). Each of these gets a value based on the first file that we received from the doctor (we manually set up the map). So, the column PATIENT_NAME might be defined in the mapping table as "10,3,25" meaning that the name starts on line 10, at character 3, and can be up to 25 characters long. This has been a painful process, both in terms of (a) creating the map for each doctor - it is tedious, and (b) maintainability, as they sometimes suddenly change their layout and then we have to remap the whole thing for that doctor.
The file is read in, line by line, and each line added to a
List<string>
Once this is done, we do the following, where we get the map data and read through the list of file lines and get the field values (recall that each mapped field is a value like "10,3,25" (without the quotes)):
ClaimMap M = ClaimMap.GetMapForDoctor(17);
List<HCFA_Claim> ClaimSet = new List<HCFA_Claim>();
foreach (List<string> cl in Claims) //Claims is List<List<string>>, where we have a List<string> for each claim in the text file (it can have more than one, and the file is split up into separate claims earlier in the process)
{
HCFA_Claim c = new HCFA_Claim();
c.Patient = new Patient();
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
//...and so on...
ClaimSet.Add(c);
}
Sorry this is so long...but I felt that some background/explanation was necessary. Are there any better/more creative ways of doing something like this?
Given the lack of standardization, I think your current solution although not ideal may be the best you can do. Given this situation, I would at least isolate concerns e.g. file read, file parsing, file conversion to standard xml, mapping table access etc. to simple components employing obvious patterns e.g. DI, strategies, factories, repositories etc. where needed to decouple the system from the underlying dependency on the mapping table and current parsing algorithms.
You need to work on the DRY (Don't Repeat Yourself) principle by separating concerns.
For example, the code you posted appears to have an explicit knowledge of:
how to parse the claim map, and
how to use the claim map to parse a list of claims.
So there are at least two responsibilities directly relegated to this one method. I'd recommend changing your ClaimMap class to be more representative of what it's actually supposed to represent:
public class ClaimMap
{
public ClaimMapField Name{get;set;}
...
}
public class ClaimMapField
{
public int StartingLine{get;set;}
// I would have the parser subtract one when creating this, to make it 0-based.
public int StartingCharacter{get;set;}
public int MaxLength{get;set;}
}
Note that the ClaimMapField represents in code what you spent considerable time explaining in English. This reduces the need for lengthy documentation. Now all the M.Name.Split calls can actually be consolidated into a single method that knows how to create ClaimMapFields out of the original text file. If you ever need to change the way your ClaimMaps are represented in the text file, you only have to change one point in code.
Now your code could look more like this:
c.Patient.FullName = cl[map.Name.StartingLine].Substring(map.Name.StartingCharacter, map.Name.MaxLength).Trim();
c.Patient.Address = cl[map.Address.StartingLine].Substring(map.Address.StartingCharacter, map.Address.MaxLength).Trim();
...
But wait, there's more! Any time you see repetition in your code, that's a code smell. Why not extract out a method here:
public string ParseMapField(ClaimMapField field, List<string> claim)
{
return claim[field.StartingLine].Substring(field.StartingCharacter, field.MaxLength).Trim();
}
Now your code can look more like this:
HCFA_Claim c = new HCFA_Claim
{
Patient = new Patient
{
FullName = ParseMapField(map.Name, cl),
Address = ParseMapField(map.Address, cl),
}
};
By breaking the code up into smaller logical pieces, you can see how each piece becomes very easy to understand and validate visually. You greatly reduce the risk of copy/paste errors, and when there is a bug or a new requirement, you typically only have to change one place in code instead of every line.
If you are only getting unstructured text, you have to parse it. If the text content changes you have to fix your parser. There's no way around this. You could probably find a 3rd party application to do some kind of visual parsing where you highlight the string of text you want and it does all the substring'ing for you but still unstructured text == parsing == fragile. A visual parser would at least make it easier to see mistakes/changed layouts and fix them.
As for parsing it yourself, I'm not sure about the line-by-line approach. What if something you're looking for spans multiple lines? You could bring the whole thing in a single string and use IndexOf to substring that with different indices for each piece of data you're looking for.
You could always use RegEx instead of Substring if you know how to do that.
While the basic approach your taking seems appropriate for your situation, there are definitely ways you could clean up the code to make it easier to read and maintain. By separating out the functionality that you're doing all within your main loop, you could change this:
c.Patient.FullName = cl[Int32.Parse(M.Name.Split(',')[0]) - 1].Substring(Int32.Parse(M.Name.Split(',')[1]) - 1, Int32.Parse(M.Name.Split(',')[2])).Trim();
to something like this:
var parser = new FormParser(cl, M);
c.PatientFullName = FormParser.GetName();
c.PatientAddress = FormParser.GetAddress();
// etc
So, in your new class, FormParser, you pass the List that represents your form and the claim map for the provider into the constructor. You then have a getter for each property on the form. Inside that getter, you perform your parsing/substring logic like you're doing now. Like I said, you're not really changing the method by which your doing it, but it certainly would be easier to read and maintain and might reduce your overall stress level.
I'm trying to send faxes through RightFax in an efficient manner.
My users need to fax PDFs and even though the application is working fine, it is very slow for bulk sending (> 20 recipients, taking abt 40 seconds per fax).
// Fax created
fax.Attachments.Add(#"C:\\Test Attachments\\Products.pdf", BoolType.False);
fax.Send();
RightFax has this concept of *Library Documents, so what I thought we could do was to store a PDF document as a Library Document on the server and then reuse it, so there is no need to upload this PDF for n users.
I can create Library Documents without problems (I can retrieve them, etc.), but how do I add a PDF to this? (I have rights on the server.)
LibraryDocument doc2 = server.LibraryDocuments.Create;
doc2.Description = "Test Doc 1";
doc2.ID = "568"; // tried ints everything!
doc2.IsPublishedForWeb = BoolType.True;
doc2.PageCount = 2;
doc2.Save();
Also, once I created a fax, the API gives you an option to "StoreAsNewLibraryDocument", which is throwing an exception when run. System.ArgumentException: Value does not fall within the expected range
fax.StoreAsNewLibraryDocument("PRODUCTS","the products");
What matters for us is how to send say 500 faxes in the most efficient way possible using the API through RFCOMAPILib. I think that if we can reuse the PDF attached, it would greatly improve perfomance. Clearly, sending a fax in 40 seconds is unacceptable when you have hundreds of recipients.
How do we send faxes with attachments in the most efficient mode through the API?
StoreAsNewLibraryDocument() is the only practical way to store LibraryDocuments using the RightFax COM API, but assuming you're not using a pre-existing LibraryDocument, you have to call the function immediately after sending the first fax, which will have a regular file (not LibraryDoc) attachment.
(Don't create a LibraryDoc object on the server yourself, as you do above - you'd only do that if you have an existing file on the server that isn't a LibraryDocument, and you want to make it into one. You'll probably never encounter such a scenario.)
The new LibraryDocument is then referenced (in subsequent fax attachments) by the ID string you specify as the first argument of StoreAsNewLibraryDocument(). If that ID isn't unique to the RightFax Server's LibraryDocuments collection, you'll get an error. (You could use StoreAsLibraryDocumentUpdate() instead, if you want to actually replace the file on the server.) Also, remember to always specify the AttachmentType.
In theory, this should be all you really have to do:
' First fax:
fax.Attachments.Add(#"C:\\Test Attachments\\Products.pdf", BoolType.False);
fax.Attachments.Item(1).AttachmentType = AttachmentType.aFile;
fax.Send();
fax.StoreAsNewLibraryDocument("PRODUCTS", "The Products");
server.LibraryDocuments("PRODUCTS").IsPublishedForWeb = BoolType.True;
' And for all subsequent faxes:
fax.Attachments.Add(server.LibraryDocuments("PRODUCTS"));
fax.Attachments.Item(1).AttachmentType = AttachmentType.aLibraryDocument;
fax.Send();
The reason I say "in theory" is because this doesn't always work. Sometimes when you call StoreAsNewLibraryDocument() you end up with a LibraryDoc with a PageCount of zero. This happens seemingly at random, and is probably due to a bug in RightFax, or possibly a server misconfiguration. So it's a very good idea to check for...
server.LibraryDocuments("PRODUCTS").PageCount = 0
...before you send any of the subsequent faxes, and if necessary retry until it works, or (if it won't) store the LibraryDoc some other way and give up on StoreAsNewLibraryDocument().
Whereas, if you don't have that problem, you can usually send a mass-fax in about a 1/10th of the time it takes when you attach (and upload) the local file each time.
If someone from OpenText/RightFax reads this and can explain why StoreAsNewLibraryDocument() sometimes results in zero-page faxes, an additional answer about that would be appreciated quite a bit!
I'm trying to parse through e-mails in Outlook 2007. I need to streamline it as fast as possible and seem to be having some trouble.
Basically it's:
foreach( Folder fld in outllookApp.Session.Folders )
{
foreach( MailItem mailItem in fld )
{
string body = mailItem.Body;
}
}
and for 5000 e-mails, this takes over 100 seconds. It doesn't seem to me like this should be taking anywhere near this long.
If I add:
string entry = mailItem.EntryID;
It ends up being an extra 30 seconds.
I'm doing all sorts of string manipulations including regular expressions with these strings and writing out to database and still, those 2 lines take 50% of my runtime.
I'm using Visual Studio 2008
Doing this kind of thing will take a long time as you having to pull the data from the exchange store for each item.
I think that you have a couple of options here..
Process this information out of band use CDO/RDO in some other process.
Or
Use MapiTables as this is the fastest way to get properties there are caveats with this though and you may be doing things in your processin that can be brought into a table.
Redemption wrapper - http://www.dimastr.com/redemption/mapitable.htm
MAPI Tables http://msdn.microsoft.com/en-us/library/cc842056.aspx
I do not know if this will address your specific issue, but the latest Office 2007 service pack made a synificant performance difference (improvement) for Outlook with large numbers of messages.
Are you just reading in those strings in this loop, or are you reading in a string, processing it, then moving on to the next? You could try reading all the messages into a HashTable inside your loop then process them after they've been loaded--it might buy you some gains.
Any kind of UI updates are extremely expensive; if you're writing out text or incrementing a progress bar it's best to do so sparingly.
We had exactly the same problem even when the folders were local and there was no network delay.
We got 10x speedup by storing a copy of every email in a local Sql Server CE table tuned for the search we needed. We also used update events to make sure the local database remains in sync with the Outlook/Exchange folders.
To totally eliminate user lag we took the search out of the Outlook thread and put it in its own thread. The perception of lagging was worse than the actual delay it seems.
I had encountered a similar situation while trying to access Outlook mails via VBA(in excel).
However, it was far more slower in my case: 1 E-mail per sec!(Maybe it was slower in mine than in your case due to the fact that I had it implemented on VBA).
Anyway, I successfully managed to improve the speed by using the SetColumnns(eg. https://learn.microsoft.com/en-us/office/vba/api/Outlook.Items.SetColumns)
I know.. I Know.. This only works for a few properties, like "Subject" and "ReceivedTime" and not for the body!
But think again, do you really want to read through the body of all your emails? or is it just a subset? maybe based on its 'Subject' line or 'ReceivedTime'?
My requirement was to just go into the body of the email in case its subject matched a specific string!
Hence, I did the below:
I had added a second 'Outlook.Items' obj called 'myFilterItemCopyForBody' and applied the same filter I had on the other 'Outlook.Items'.
so, now I have two 'Outlook.Items' : 'myFilterItem' and 'myFilterItemCopyForBody' both with the same E-mail items since the same Restrict conditions are applied on both.
'myFilterItem'- to hold only 'Subject' and 'ReceivedTime' properties of the relevant mails (done by using SetColumns)
'myFilterItemCopyForBody'- to hold all the properties of the mail(including Body)
Now, both 'myFilterItem' and 'myFilterItemCopyForBody' are sorted with 'ReceivedTime' to have them in the same order.
Once sorted, both are looped simultaneously in a nested for each loop and pick corresponding properties (with the help of a counter) as in the code below.
Dim myFilterItem As Outlook.Items
Dim myItems As Outlook.Items
Set myItems = olFldr.Items
Set myFilterItemCopyForBody = myItems.Restrict("#SQL=""urn:schemas:httpmail:datereceived"" > '" & startTime & "' AND ""urn:schemas:httpmail:datereceived"" < '" & endTime & "'")
Set myFilterItem = myItems.Restrict("#SQL=""urn:schemas:httpmail:datereceived"" > '" & startTime & "' AND ""urn:schemas:httpmail:datereceived"" < '" & endTime & "'")
myFilterItemCopyForBody.Sort ("ReceivedTime")
myFilterItem.Sort ("ReceivedTime")
myFilterItem.SetColumns ("Subject, ReceivedTime")
For Each myItem1 In myFilterItem
iCount = iCount + 1
For Each myItem2 In myFilterItemCopyForBody
jCount = jCount + 1
If iCount = jCount Then
'Display myItem2.Body if myItem1.Subject contain a specific string
'MsgBox myItem2.Body
jCount = 0
Exit For
End If
Next myItem2
Next myItem1
Note1: Notice that the Body property is accessed using the 'myItem2' corresponding to 'myFilterItemCopyForBody'.
Note2: The lesser the number of times the compiler enters the loop to access the body property, the better! You can further improve the efficiency by playing with the Restrict and the logic to lower down the number of times the compiler has to loop through the logic.
Hope this helps, even though this is not something new!