Identify Phrases/Words in a Sentence - c#

I'm looking for an API or library that can identify certain phrases/words in a sentence or short phrase?
The application of this is to find items which are not permitted. For instance, we will have a definitive list of phrases/words that are not permitted e.g., knife, battery, oil, paint, nail polish, glass etc.
We will then have a list of short phrases that need to be checked against this list. Ideally the API should handle pluralisation, misspellings and number substitutions e.g., 0 as o
UPDATE 15 Oct 2021:
Over the past few months, I've been experimenting with Fuse (JavaScript) and FuzzySharp. However, I’m still struggling to find a solution that can accurately identify words/phrases.
Using FuzzySharp, some examples of false positives are:
“used hot water bottle” – this matched “water” with a score of 90
“waterproof trousers” – this matched “water” with a score of 90
“oil” – this matched “toiletries” with a score of 90
“toilet” – this matched “toiletries” with a score of 90
I understand why these have a high score, however I’m unsure of what technology I should be using improve accuracy. At the moment we only have 298 phrases that we want to search against

Related

Find possible combinations

I am trying to make a function that returns all possible combinations of a given set of data.
It's for a hotel yield management problem
There are 3 factors to consider:
5 rooms available
18 incoming booking requests
7 days of booking
If a room is booked for a given day, that room cannot be booked further until it is free the next day.
I believe this might be a simple math problem, however it is not my strongest side, so I ask you for help in order to find the way to create all the possible booking combinations
Best regards
Edit:
As requested, here are some additional details:
The goal is to find the highest possible revenue.
All of the 18 requests come on different days, different stay length and room rates, but there are only 5 rooms in this "hotel".
So what I want is to find out how many booking combinations can be done in 7 days with 5 rooms and 18 requests.
Then later I would go through every combination and tally the price to find the best one (exhaustive search)
Does that help?
In your case i would make a brute force attack , because u have a fairly limited combinations. it would be n^3 complexity

Complicated Logic for Variation of same rhythm C#

I am working on checking the variation on some integers and filtering it up , so let me say that we start entering integers of 700,768,820,320,790,260,etc... so I want to check if the each integer is within the same rhythm and harmony so 320 will be remove or ignored,
Let us say the variation must not be less the 75% of the lower number and must not the higher than 75%.
Actually the problem is not checking the new entry if it is higher or lower , but the problem if the rhythm or harmony of the entries suddenly become higher , so lets take an example :
768,799,890,320,380,799,820,1230,1300,1340,1342,1400,680,1340,1280,1490 ,
so in that case I we started in a range of 700 ~ 890 so 320 will be filtered out , and when suddenly the range became between 1230 ~ 1400 so 680 . and we cannot expect what would be the range ,
so how to make a logic that can filter out and put the higher and lower limitations?
no need for any code just I need for logical explanation .
Regards...

Detecting rare incidents from multivariate time series intervals

Given a time series of sensor state intervals, how do I implement a classifier which learns from supervised training data to detect an incident based on a sequence of state intervals? To simplify the problem, sensor states are reduced to either true or false.
Update: I've found this paper (PDF) on Mining Sequences of Temporal Intervals which addresses a similar problem. Another paper (Google Docs) on Mining Hierarchical Temporal Patterns in Multivariate Time Series takes a novel approach, but deals with hierarchical data.
Example Training Data
The following data is a training example for an incident, represented as a graph over time, where /¯¯¯\ represents a true state interval and \___/ a false state interval for a sensor.
Sensor | Sensor State over time
| 0....5....10...15...20...25... // timestamp
---------|--------------------------------
A | ¯¯¯¯¯¯¯¯¯¯¯¯\________/¯¯¯¯¯¯¯¯
B | ¯¯¯¯¯\___________________/¯¯¯¯
C | ______________________________ // no state change
D | /¯\_/¯\_/¯\_/¯\_/¯\_/¯\_/¯\_/¯
E | _________________/¯¯¯¯¯¯¯¯\___
Incident Detection vs Sequence Labeling vs Classification
I initially generalised my problem as a two-category sequence labeling problem, but my categories really represented "normal operation" and a rare "alarm event" so I have rephrased my question as incident detection. Training data is available for "normal operation" and "alarm incident".
To reduce problem complexity, I have discretized sensor events to boolean values, but this need not be the case.
Possible Algorithms
A hidden Markov model seems to be a possible solution, but would it be able to use the state intervals? If a sequence labeler is not the best approach for this problem, alternative suggestions would be appreciated.
Bayesian Probabilistic Approach
Sensor activity will vary significantly by time of day (busy in mornings, quiet at night). My initial approach would have been to measure normal sensor state over a few days and calculate state probability by time of day (hour). The combined probability of sensor states at an unlikely hour surpassing an "unlikelihood threshold" would indicate an incident. But this seemed like it would raise a false alarm if the sensors were noisy. I have not yet implemented this, but I believe that approach has merit.
Feature Extraction
Vector states could be represented as state interval changes occurring at a specific time and lasting a specific duration.
struct StateInterval
{
int sensorID;
bool state;
DateTime timeStamp;
TimeSpan duration;
}
eg. Some State Intervals from the process table:
[ {D, true, 0, 3} ]; [ {D, false, 4, 1} ]; ...
[ {A, true, 0, 12} ]; [ {B, true, 0, 6} ]; [ {D, true, 0, 3} ]; etc.
A good classifier would take into account state-value intervals and recent state changes to determine if a combination of state changes closely matches training data for a category.
Edit: Some ideas after sleeping on how to extract features from multiple sensors' alarm data and how to compare it to previous data...
Start by calculating the following data for each sensor for each hour of the day:
Average state interval length (for true and false states)
Average time between state changes
Number of state changes over time
Each sensor could then be compared to every other sensor in a matrix with data like the following:
Average time taken for sensor B to change to a true state after sensor A did. If an average value is 60 seconds, then a 1-second wait would be more interesting than a 120-second wait.
Average number of state changes sensor B underwent while sensor A was in one state
Given two sets of training data, the classifier should be able to determine from these feature sets which is the most likely category for classification.
Is this a sensible approach and what would be a good algorithm to compare these features?
Edit: the direction of a state change (false->true vs true-false) is significant, so any features should take that into account.
A simple solution would be collapse the time aspect of your data and take each timestamp as one instance. In this case, the values of the sensors are considered your feature vector, where each time step is labeled with a class value of category A or B (at least for the labeled training data):
sensors | class
A B C D E |
-------------------------
1 1 1 0 0 | catA
1 0 0 0 0 | catB
1 1 0 1 0 | catB
1 1 0 0 0 | catA
..
This input data is fed to the usual classification algorithms (ANN, SVM, ...), and the goal is to predict the class of unlabeled time series:
sensors | class
A B C D E |
-------------------------
0 1 1 1 1 | ?
1 1 0 0 0 | ?
..
An intermediary step of dimensionality reduction / feature extraction could improve the results.
Obviously this may not be as good as modeling the time dynamics of the sequences, especially since techniques such as Hidden Markov Models (HMM) take into account the transitions between the various states.
EDIT
Based on your comment below, it seems that the best way to get less transitory predictions of the target class is to a apply a post-processing rule at the end of the prediction phase, and treating the classification output as a sequence of consecutive predictions.
The way this works is that you would compute the class posterior probabilities (ie: probability distribution that an instance belong to each class label, which in the case of binary SVM are easily derived from the decision function), then given a specified threshold, you check if the probability of the predicted class is above that threshold: if it is we use that class to predict the current timestamp, if not then we keep the previous prediction, and the same goes for future instances. This has the effect of adding a certain inertia to the current prediction.
This doesn't sound like a classification problem. Classifiers aren't really meant to take into account "a combination of state changes." It sounds like a sequence labeling problem. Look into using a Hidden Markov Model or a Conditional Random Field. You can find an efficient implementation of the latter at http://leon.bottou.org/projects/sgd.
Edit:
I've read through your question in a little more detail, and I don't think and HMM is the best model given what you want to do with features. It's going to blow up your state space and could make inference intractable. You need a more expressive model. You could look at Dynamic Bayesian Networks. They generalize HMMs by allowing the state space to be represented in factored form. Kevin Murphy's dissertation is the most thorough resource for them I've come across.
I'll still like CRFs though. Just as an easy place to start, define one with the time of day and each of the sensor readings as the features for each observation and use bigram feature functions. You can see how it performs and increase complexity of your features from there. I would start simple though. I think you're underestimating how difficult some of your ideas will be to implement.
Why reinvent the wheel? Check out TClass
If that doesn't cut it for you, you can find there also a number of pointers. I hope this helps.

Create a summary description of a schedule given a list of shifts

Assuming I have a list of shifts for an event (in the format start date/time, end date/time) - is there some sort of algorithm I could use to create a generalized summary of the schedule? It is quite common for most of the shifts to fall into some sort of common recurrence pattern (ie. Mondays from 9:00 am to 1:00 pm, Tuesdays from 10:00 am to 3:00 pm, etc). However, there can (and will be) exceptions to this rule (eg. one of the shifts fell on a holiday and was rescheduled for the next day). It would be fine to exclude those from my "summary", as I'm looking to provide a more general answer of when does this event usually occur.
I guess I'm looking for some sort of statistical method to determine the day and time occurences and create a description based on the most frequent occurences found in the list. Is there some sort of general algorithm for something like this? Has anyone created something similar?
Ideally I'm looking for a solution in C# or VB.NET, but don't mind porting from any other language.
Thanks in advance!
You may use Cluster Analysis.
Clustering is a way to segregate a set of data into similar components (subsets). The "similarity" concept involves some definition of "distance" between points. Many usual formulas for the distance exists, among others the usual Euclidean distance.
Practical Case
Before pointing you to the quirks of the trade, let's show a practical case for your problem, so you may get involved in the algorithms and packages, or discard them upfront.
For easiness, I modelled the problem in Mathematica, because Cluster Analysis is included in the software and very straightforward to set up.
First, generate the data. The format is { DAY, START TIME, END TIME }.
The start and end times have a random variable added (+half hour, zero, -half hour} to show the capability of the algorithm to cope with "noise".
There are three days, three shifts per day and one extra (the last one) "anomalous" shift, which starts at 7 AM and ends at 9 AM (poor guys!).
There are 150 events in each "normal" shift and only two in the exceptional one.
As you can see, some shifts are not very far apart from each other.
I include the code in Mathematica, in case you have access to the software. I'm trying to avoid using the functional syntax, to make the code easier to read for "foreigners".
Here is the data generation code:
Rn[] := 0.5 * RandomInteger[{-1, 1}];
monshft1 = Table[{ 1 , 10 + Rn[] , 15 + Rn[] }, {150}]; // 1
monshft2 = Table[{ 1 , 12 + Rn[] , 17 + Rn[] }, {150}]; // 2
wedshft1 = Table[{ 3 , 10 + Rn[] , 15 + Rn[] }, {150}]; // 3
wedshft2 = Table[{ 3 , 14 + Rn[] , 17 + Rn[] }, {150}]; // 4
frishft1 = Table[{ 5 , 10 + Rn[] , 15 + Rn[] }, {150}]; // 5
frishft2 = Table[{ 5 , 11 + Rn[] , 15 + Rn[] }, {150}]; // 6
monexcp = Table[{ 1 , 7 + Rn[] , 9 + Rn[] }, {2}]; // 7
Now we join the data, obtaining one big dataset:
data = Join[monshft1, monshft2, wedshft1, wedshft2, frishft1, frishft2, monexcp];
Let's run a cluster analysis for the data:
clusters = FindClusters[data, 7, Method->{"Agglomerate","Linkage"->"Complete"}]
"Agglomerate" and "Linkage" -> "Complete" are two fine tuning options of the clustering methods implemented in Mathematica. They just specify we are trying to find very compact clusters.
I specified to try to detect 7 clusters. If the right number of shifts is unknown, you can try several reasonable values and see the results, or let the algorithm select the more proper value.
We can get a chart with the results, each cluster in a different color (don't mind the code)
ListPointPlot3D[ clusters,
PlotStyle->{{PointSize[Large], Pink}, {PointSize[Large], Green},
{PointSize[Large], Yellow}, {PointSize[Large], Red},
{PointSize[Large], Black}, {PointSize[Large], Blue},
{PointSize[Large], Purple}, {PointSize[Large], Brown}},
AxesLabel -> {"DAY", "START TIME", "END TIME"}]
And the result is:
Where you can see our seven clusters clearly apart.
That solves part of your problem: identifying the data. Now you also want to be able to label it.
So, we'll get each cluster and take means (rounded):
Table[Round[Mean[clusters[[i]]]], {i, 7}]
The result is:
Day Start End
{"1", "10", "15"},
{"1", "12", "17"},
{"3", "10", "15"},
{"3", "14", "17"},
{"5", "10", "15"},
{"5", "11", "15"},
{"1", "7", "9"}
And with that you get again your seven classes.
Now, perhaps you want to classify the shifts, no matter the day. If the same people make the same task at the same time everyday, so it's no useful to call it "Monday shift from 10 to 15", because it happens also on Weds and Fridays (as in our example).
Let's analyze the data disregarding the first column:
clusters=
FindClusters[Take[data, All, -2],Method->{"Agglomerate","Linkage"->"Complete"}];
In this case, we are not selecting the number of clusters to retrieve, leaving the decision to the package.
The result is
You can see that five clusters have been identified.
Let's try to "label" them as before:
Grid[Table[Round[Mean[clusters[[i]]]], {i, 5}]]
The result is:
START END
{"10", "15"},
{"12", "17"},
{"14", "17"},
{"11", "15"},
{ "7", "9"}
Which is exactly what we "suspected": there are repeated events each day at the same time that could be grouped together.
Edit: Overnight Shifts and Normalization
If you have (or plan to have) shifts that start one day and end on the following, it's better to model
{Start-Day Start-Hour Length} // Correct!
than
{Start-Day Start-Hour End-Day End-Hour} // Incorrect!
That's because as with any statistical method, the correlation between the variables must be made explicit, or the method fails miserably. The principle could run something like "keep your candidate data normalized". Both concepts are almost the same (the attributes should be independent).
--- Edit end ---
By now I guess you understand pretty well what kind of things you can do with this kind if analysis.
Some references
Of course, Wikipedia, its "references" and "further reading" are good guide.
A nice video here showing the capabilities of Statsoft, but you can get there many
ideas about other things you can do with the algorithm.
Here is a basic explanation of the algorithms involved
Here you can find the impressive functionality of R for Cluster Analysis (R is a VERY good option)
Finally, here you can find a long list of free and commercial software for statistics in general, including clustering.
HTH!
I don't think any ready made algorithm exists, so unfortunately you need to come up with something yourself. Because the problem is not really well defined (from mathematical perspective) it will require testing on some "real" data that would be reasonably representative, and a fair bit of tweaking.
I would start from dividing your shifts into weekdays (because if I understand correctly you are after a weekly view) - so for each weekday we have shifts that happen to be on that day. Then for each day I would group the shifts that happen at the same time (or "roughly" at the same time - here you need to come up with some heuristic, i.e. both start and end times do not deviate from average in the group by more than 15min or 30 mins). Now we need another heuristic to decide if this group is relevant, i.e. if a shift 1pm-3pm on a Monday happened only once it is probably not relevant, but if it happened on at least 70% of Mondays covered by the data then it is relevant. And now your relevant groups for each day of the week will form the schedule you are after.
Could we see an example data set? If it is really "clean" data then you could simply find the mode of the start and end times.
One option would be to label all the start times as +1 and the end times as -1 then create a three column table of times (both start and ends), label (+1 or -1), and number of staff at that time (starts with zero and adds or subtracts staff using the label) and sort the whole thing in time order.
This time series now is a summary descriptor of your staff levels and the labels are also a series as well. Now you can apply time series statistics to both to look for daily, weekly or monthly patterns.

Creditcard verification with regex?

What is the right way to verify a credit card with a regex? If which one to use there are tons online. If not how to verify?
See this link Finding or Verifying Credit Card Numbers with Regulars Expressions
Visa: ^4[0-9]{12}(?:[0-9]{3})?$ All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.
MasterCard: ^5[1-5][0-9]{14}$ All MasterCard numbers start with the numbers 51 through 55. All have 16 digits.
American Express: ^3[47][0-9]{13}$ American Express card numbers start with 34 or 37 and have 15 digits.
Diners Club: ^3(?:0[0-5]|[68][0-9])[0-9]{11}$ Diners Club card numbers begin with 300 through 305, 36 or 38. All have 14 digits. There are Diners Club cards that begin with 5 and have 16 digits. These are a joint venture between Diners Club and MasterCard, and should be processed like a MasterCard.
Discover: ^6(?:011|5[0-9]{2})[0-9]{12}$ Discover card numbers begin with 6011 or 65. All have 16 digits.
JCB: ^(?:2131|1800|35\d{3})\d{11}$ JCB cards beginning with 2131 or 1800 have 15 digits. JCB cards beginning with 35 have 16 digits.
Bye.
How can I use credit card numbers containing spaces? covers everything you should need.
I think you're looking for the Luhn Algorithm. It's a simple checksum formula used to validate a variety of identification numbers.
That depends on how accurate you want your pre-validation to be. To validate everything you can, you need to compute what the last digit of the card should be and compare to what is entered, which a RegEx cannot do.
For the algorithm and other details see this link, which also provides a list of common number prefixes that you could validate against.
-- Edit:
Infact, I'll slightly disagree with myself and agree with cletus. Validate as much as you can (without getting into details of specific types of credit cards [IMHO]) before sending it on. And it goes without saying (hopefully), that this validation should be done in JavaScript, to make it fast, then on the server, to double check (and for JavaScript disabled people).
-- Previous Response:
Don't bother; just let the provider verify it when you actually attempt payment. No legitimate reason to try and verify it yourself. You can use this though, if you really feel like it.

Categories