Adding entities to Stanford NLP NER Classifier

Adding entities to Stanford NLP NER Classifier - c#

I have a very simple method to extract Names, Organisations and Locations from a string. I am using the .NET Nuget Libraries for Stanford NLP. It looks like this.
CRFClassifier Classifier = CRFClassifier.getClassifier(StanfordNLPConfig.NER.ClassifierModel);
List<IndexViewModel> ivms = new List<IndexViewModel>();
try
{
foreach (List sentence in Classifier.classify(content).toArray())
{
NLPTranslator translator = new NLPTranslator();
ivms.AddRange(translator.NERTranslate(sentence));
}
}
catch (Exception ex)
{
throw ex;
// Error silently
}
return ivms;
The model is the 3class jar file it came with - english.all.3class.distsim.crf.ser.gz.
This is working really well for me, but would I'd like to do is interface with the model to be able to add in my own entities should I need to, this seems very American Centric and I'd like to be able to put my own UK companies, locations etc.
Is there any way I can just add in these entities as I have been reading about training it but that you possibly can't extend the model, if this is the case can I combine Classifiers and run it through a UK one/US one etc. If that's possible, how can I actually make my own Classifier as I would like to make these in .NET if possible.

If you want to change the way the model works you'll need to provide training data and train your own model. For licensing reasons Stanford doesn't distribute the data the public models are trained on, but you can use the same features.
You can read about how to train a model here, though unfortunately the first sentence of the instructions is this:
The documentation for training your own classifier is somewhere between bad and non-existent.
If this is your first time working with a CRF there will many things to learn but it is manageable. It may be helpful to look at documentation for other packages such as CRFSuite and CRF++ - they generally all use basically the same training data format and are similar in many ways.
Also note that the existing models cannot be extended by training them just on new input, the system just isn't set up that way.

Related

Factory Pattern implementation coupled with reading and writing

I'm trying to design application in right manner, it should
Read invoice data from SQL Server (2 queries depending on type of
invoice: sales or purchase)
Process it (Acme may need less fields than SugarCorp and in different formatting)
Output txt or csv (that may change in future)
I found factory pattern helpful so prepeared a UML diagram according to my concern.
Each InvoiceFactoryProvider can generate PInvoice or SInvoice (specific to them). CreatePInvoice() and CreateSInvoice() should call load() and save() methods.
How to couple load() with SQLReader class to get each row as PInvoice object? And save() with my IDataWriter interface. Could you provide some example / advice?
Edit:
After reviewing examples of Bridge Pattern, as Atul suggested, I created a class diagram for this problem using it, which looks like this:
Invoice SQL queries may vary (application may load invoice data from different systems - PollosInvoice or StarInvoice) and how they are processed (different implementations).
In this case I decoupled abstraction - Invoice from its implementation - exporting invoice to certain software (AcmeExporter or SigmaExporter). AcmeExporter and SigmaExport will set their fields according to specification - date of transaction, payment method, invoice type etc. taken from Invoice's DataTable. ExportInvoice() will return DataTable with needed data. InvoiceExporter is also using two interfaces for encoding and file format.
What do you think about it? What kind of flaws / advantages does it have?

Currently it looks like you are using Abstract Factory design pattern for the creation of your products (invoices). But point to note is that your load and save method are inside product(invoice) so in that its always better to go with Bridge Design Pattern. Your product will use the implementation of Reader and Writer to load and save the records.
Note: You would be able to use AbstractFactory even with this design pattern.
It will looks something like below... (just an Analogy)

Custom editortemplate when property has some attribute on it

This is what the implementation would look like
public class Product
{
public integer id {get;set;}
[MultiLangual]
public string name {get;set;}
}
In the database, name would contain something like:
{en:Pataoto, nl: Aardappel, de: Patat, fr: pommes de terre}
This would contain all the translated fields, that a client has given to his own product.
(in this case: a patato).
In the frontend, this would appear as multiple html elements, which i (somehow) detect which language it is, on submitting the form.
My question is, how would i do this? I'm always stuck on creating the attribute and don't know where to continue...
In my attribute, i shouldn't do a lot, just something like this (i think):
public class MultiLangualAttribute : Attribute
{
public MultiLangualAttribute() : base()
{
}
public override string ToString()
{
return base.ToString();
}
}
But how would i detect everything in my views and create a custom layout for it (this should work with and .
It would only contain text.
Any ideas or a better implementation of above, would be VERY usefull :)

I think the better (arguably) implementation is standard way of application localization.
You define your resources and strings under App_GlobalResources folder you will have to create.
For example you will create file Fruits.resx with all your fruits you want to translate in your system language.
Afterwards you will create Fruits.de.resx, Fruits.es.resx etc, with all the languages you want to have in your website.
It is also possible to update the resources at runtime.
Its too much to describe all the approach in this answer, I would rather provide a link or two with detailed tutorial on MVC application localization:
This is classic ASP.NET MVC localization explanation:
Globalization And Localization With Razor Web Pages
Another explanation of the same thing, little more detailed is here:
ASP.NET MVC Localization: Generate resource files and localized views using custom templates
This should be enough for you to localize your app the standard way.
This is a little more advanced approach, when they use language as part of the URL you accessing.
es.yourdomain.com will be in Spanish, fr.yourdomain.com will be in French:
Localization in ASP.NET MVC – 3 Days Investigation, 1 Day Job
With regards to your approach (storing different languages in the database) here's link to microsoft approach for this. Its much more involved and complex, and I am not sure if benefitting you by its complexity, since you end up using database to fetch every single string in your app. Not the most efficient, but possible approach as well:
Extending the ASP.NET Resource-Provider Model
Hope this all will be of helps to you & good luck

Which design patterns are useful to do this?

I have a device which have low level programming. I am giving version numbers every new devices and upgrades. I also have a program which communicate with these devices (to retrieving information on these devices).
For eg. v1.2 sends this kind of string:
v1.2|Time|Conductivity|Repetation|Time|Heat of First Nozzle|Pressure|EndOfMessage
but new version of device program:
v1.3|Time|Conductivity|Repetation|Time|Humadity|1st Nozzle Heat;2nd Nozzle Heat|Pressure|EndOfMessage
My test application will retrieve information and change the operation of this device. Some operations will have in v1.2 device some not. I thought strategy design pattern seems useful for this situation but I'm not sure. Which design pattern should I use to do this?

Yes, this would be a good use-case for the Stategy pattern, although you will also use the Factory pattern to create a specific parser instance.
Your code should then generally look something like this:
public DeviceInfo Parse(InputData input)
{
var version = versionParser.Parse(input);
var concreteParser = parserFactory.CreateFor(version);
var data = concreteParser.Parse(data);
return data;
}
For a simple project with few parsers, you may hardcode your parser factory:
public class ParserFactory
{
public static IParser<DeviceInfo> CreateFor(Version version)
{
// instantiate proper parser based on version
}
}
Depending on the size of your project, you may also decide to use a plugin pattern for your parsers (System.AddIn contains useful classes for managing plugins).

I feel Strategy along with Factory method will solve the purpose.

How to handle multiple object types when creating a new Type

Been tasked to write some asset tracking software...
Want to try to do this the right way. So I thought that a lot of assets had common fields.
For instance, a computer has a model and a manufacturer which a mobile phone also has.
I would want to store computers, monitors, mobile phones, etc. So I thought the common stuff can be taken into account using an abstract base class. The other properties that do not relate to one another would be stored in the actual class itself.
For instance,
public abstract class Asset {
private string manufacturer;
public string Manufacturer { get; set; }
//more common fields
}
public class Computer : Asset {
private string OS;
public strin OS { get; set; }
//more fields pertinent to a PC, but inherit those public properties of Asset base
}
public class Phone : Asset {
//etc etc
}
But I have 2 concerns:
1)If I have a web form asking someone to add an asset I wanted to give them say a radio box selection of the type of asset they were creating. Something to the effect of:
What are you creating
[]computer
[]phone
[]monitor
[OK] [CANCEL]
And they would select one but I dont want to end up with code like this:
pseudocode:
select case(RadioButtonControl.Text)
{
case "Computer": Computer c = new Computer(args);
break;
case "Phone": Phone p = new Phone(args);
break;
....
}
This could get ugly....
Problem 2) I want to store this information in one database table with a TypeID field that way when an Insert into the database is done this value becomes the typeid of the row (distinguishes whether it is a computer, a monitor, a phone, etc). Should this typeid field be declared inside the base abstract class as some sort of enum?
Thanks

My advice is to avoid this general design altogether. Don't use inheritance at all. Object orientation works well when different types of objects have different behavior. For asset tracking, none of the objects really has any behavior at all -- you're storing relatively "dumb" data, none of which does (or should) really do anything at all.
Right now, you seem to be approaching this as an object oriented program with a database as a backing store (so to speak). I'd reverse that: it's a database with a front-end that is (or at least might be) object oriented.
Then again, unless you have some really specific and unusual needs in your asset tracking, chances are that you shouldn't do this at all. There are literally dozens of perfectly reasonable asset tracking packages already on the market. Unless your needs really are pretty unusual, reinventing this particular wheel won't accomplish much.
Edit: I don't intend to advise against using OOP within the application itself at all. Quite the contrary, MVC (for example) works quite well, and I'd almost certainly use it for almost any kind of task like this.
Where I'd avoid OOP would be in the design of the data being stored. Here, you benefit far more from using something like an SQL-based database via something like OLE DB, ODBC, or JDBC.
Using a semi-standard component for this will give you things like scalability and incremental backup nearly automatically, and is likely to make future requirements (e.g. integration with other systems) considerably easier, as you'll have a standardized, well understood layer for access to the data.
Edit2: As far as when to use (or not use) inheritance, one hint (though I'll admit it's no more than that) is to look at behaviors, and whether the hierarchy you're considering really reflects behaviors that are important to your program. In some cases, the data you work with are relatively "active" in the program -- i.e. the behavior of the program itself revolves around the behavior of the data. In such a case, it makes sense (or at least can make sense) to have a relatively tight relationship between the data and the code.
In other cases, however, the behavior of the code is relatively unaffected by the data. I would posit that asset tracking is such a case. To the asset tracking program, it doesn't make much (if any) real difference whether the current item is a telephone, or a radio, or a car. There are a few (usually much broader) classes you might want to take into account -- at least for quite a few businesses, it matters whether assets are considered "real estate", "equipment", "office supplies", etc. These classifications lead to differences in things like how the asset has to be tracked, taxes that have to be paid on it, and so on.
At the same time, two items that fall under office supplies (e.g. paper clips and staples) don't have significantly different behaviors -- each has a description, cost, location, etc. Depending on what you're trying to accomplish, each might have things like a trigger when the quantity falls below a certain level, to let somebody know that it's time to re-order.
One way to summarize that might be to think in terms of whether the program can reasonably work with data for which it wasn't really designed. For asset tracking, there's virtually no chance that you can (or would want to) create a class for every kind of object somebody might decide to track. You need to plan from the beginning on the fact that it's going to be used for all kinds of data you didn't explicitly account for in the original design. Chances are that for the majority of items, you need to design your code to be able to just pass data through, without knowing (or caring) much about most of the content.
Modeling the data in your code makes sense primarily when/if the program really needs to know about the exact properties of the data, and can't reasonably function without it.

Report handler architecture question

I am attempting to have a ReportHandler service to handle report creation. Reports can have multiple, differing number of parameters that could be set. In the system currently there are several different methods of creating reports (MS reporting services, html reports, etc) and the way the data is generated for each report is different. I am trying to consolidate everything into ActiveReports. I can't alter the system and change the parameters, so in some cases I will essentially get a where clause to generate the results, and in another case I will get key/value pairs that I must use to generate the results. I thought about using the factory pattern, but because of the different number of query filters this won't work.
I would love to have a single ReportHandler that would take my varied inputs and spit out report. At this point I'm not seeing any other way than to use a big switch statement to handle each report based on the reportName. Any suggestions how I could solve this better?

From your description, if you're looking for a pattern that matches better than Factory, try Strategy:
Strategy Pattern
Your context could be a custom class which encapsulates and abstracts the different report inputs (you could use the AbstractFactory pattern for this part)
Your strategy could implement any number of different query filters or additional logic needed. And if you ever need to change the system in the future, you can switch between report tools by simply creating a new strategy.
Hope that helps!

In addition to the strategy pattern, you can also create one adaptor for each of your underlying solutions. Then use strategy to vary them. I've built similar with each report solution being supported by what I called engines, In addition to the variable report solution we have variable storage solution as well - output can be stored in SQL server or file system.
I would suggest using a container then initializing it with the correct engine, e.g.:
public class ReportContainer{
public ReportContainer ( IReportEngine reportEngine, IStorageEngine storage, IDeliveryEngine delivery...)
}
}
/// In your service layer you resolve which engines to use
// Either with a bunch of if statements / Factory / config ...
IReportEngine rptEngine = EngineFactory.GetEngine<IReportEngine>( pass in some values)
IStorageEngine stgEngine = EngineFactory.GetEngine<IStorageEngien>(pass in some values)
IDeliverEngine delEngine = EngineFactory.GetEngine<IDeliverEngine>(pass in some values)
ReportContainer currentContext = new ReportContainer (rptEngine, stgEngine,delEngine);
then ReportContainer delegates work to the dependent engines...

We had a similar problem and went with the concept of "connectors" that are interfaces between the main report generator application and the different report engines. By doing this, we were able to create a "universal report server" application. You should check it out at www.versareports.com.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.