How to use GloVe word embedding model in ML.net

How to use GloVe word embedding model in ML.net - c#

I'm new to Machine Learning and working on my master thesis using ML.net. I'm trying use glove model to vectorise a CV text, but finding it hard to wrap my head over the process. I have the Pipeline setup as below:
var pipeline = context.Transforms.Text.NormalizeText("Text", null,
keepDiacritics: false, keepNumbers: false, keepPunctuations: false)
.Append(context.Transforms.Text.TokenizeIntoWords("Tokens", "Text"))
.Append(context.Transforms.Text.RemoveDefaultStopWords("WordsWithoutStopWords", "Tokens", Microsoft.ML.Transforms.Text.StopWordsRemovingEstimator.Language.English))
.Append(context.Transforms.Text.ApplyWordEmbedding("Features", "WordsWithoutStopWords",
Microsoft.ML.Transforms.Text.WordEmbeddingEstimator.PretrainedModelKind.GloVe300D));
var embeddingTransformer = pipeline.Fit(emptyData);
var predictionEngine = context.Model.CreatePredictionEngine<Input,Output>(embeddingTransformer);
var data = new Input { Text = TextExtractor.Extract("/attachments/CV6.docx")};
var prediction = predictionEngine.Predict(data);
Console.WriteLine($"Number of features: {prediction.Features.Length}");
Console.WriteLine("Features: ");
foreach(var feature in prediction.Features)
{
Console.Write($"{feature} ");
}
Console.WriteLine(Environment.NewLine);
From what I've studied about vectorization, each word in the document should be converted into vector, but when I'm printing the features, I can see 900 features getting printed. Can someone explain how this works? There are very less examples and tutorials available about ML.net on internet.

The vector of 900 features coming the WordEmbeddingEstimator is the min/max/average of the individual word embeddings in your phrase. Each of the min/max/average are 300 dimensional for the GloVe 300D model, giving 900 total.
The min/max gives the bounding hyper-rectangle for the words in your phrase. The average gives the standard phrase embedding.
See: https://github.com/dotnet/machinelearning/blob/d1bf42551f0f47b220102f02de6b6c702e90b2e1/src/Microsoft.ML.Transforms/Text/WordEmbeddingsExtractor.cs#L748-L752

GloVe is short for Global Vectorization.
GloVe is an unsupervised (no human labeling of the of the training set) learning method. The vectors associated with each word are generally derived from each word's proximity with others in sentences.
Once you have trained your network (presumably on a much larger data set than a single CV/resume) then you can make interesting comparisons between words based on their absolute and relative "positions" in the vector space. A much less computationally expensive way of developing a network to analyze e.g. documents is to download a pre-trained dataset. I'm sure you've found this page (https://nlp.stanford.edu/projects/glove/) which, among other things, will allow you to access pre-trained word embeddings/vectorizations.
Final thoughts: I'm sorry if all of this is redundant information for you, especially if this really turns out to be a ML.net framework syntax question. I don't know exactly what your goal is but
900 dimensions seems like an awful lot for looking at CV's. Maybe this is an ML.net default? I suspect that 300-500 will be more than adequate. See what the pre-trained data sets provide.
If you only intend to train your network from zero on a single CV, then this method is going to be wholly inadequate.
Your best approach is likely to be a sort of transfer learning approach where you obtain a liberally licensed network that has been pre-trained on a massive data set in your language of interest (usually easy for academic work). Then perform additional training using a smaller, targeted group of training-only CV's to add any specialized words to the 'vocabulary' of your network. Then perform your experimentation and analysis on a set of test CV's, which have never been used to train the network.

Related

Fuzzy Entity Recognition

I am new to NLP. What I am trying to do (in c#) is given a list of custom entities, along lines of
> NAME|ENTITY TYPE|ID
> Cubbies|Baseball Team|CHI
> Chicago Cubs|Baseball Team|CHI
> Dubs|Basketball Team|GSW
> Golden State Warriors|Basketball Team|GSW
I am looking to take short sentences and tag fuzzy matches of these entities.
For example, parse
Jordan Bell is going to make Golden St. much better next year
into
Jordan Bell is going to make [Basketball Team|GSW] much better next year".
Ideally this would be in conjunction with generalized name recognition eg:
[Person:Jordan Bell] is going to make [Basketball Team:GSW] much better [Time:next year]".
Grateful for any help or direction. Thanks!

It's probably best to think of your problem in two parts: role labelling (Named Entity Recognition) and label unification (fuzzy matching).
For determining labels - that is, marking tokens in the sentences as team name, person, and so on - a Conditional Random Field (CRF) is a good model. CRF++ is a popular toolkit. The New York Times used CRF++ with some success on recipe data a few years back. Here's a bit from their article:
Since you're identifying the names of sports teams, you have two options for dealing with the fuzzy matching you described. You can do actual fuzzy matching using string similarity; this article explains how that was done in Python library Fuzzy Wuzzy at a high enough level it should be easy to re-implement.
Your other option is Named Entity Resolution, which is tying named entities (your labelled bits) to an external database. When you do this with Wikipedia it's called "Wikification", for example. This article describes someone using Wikipedia redirect information to recognize alternate names for companies - you could to the same thing by checking that Wikipedia redirects Cubbies to Chicago Cubs (it does).
Without knowing your data, it's hard to say whether fuzzy matching or Named Entity Resolution would be easier, so it's probably best to give them both a shot.
Sorry for not including resources explicitly for C# - that said, the techniques here are usually more important than the implementations.

What is the formula to calculate a QR Code's maximum data?

I've Google'd and read quite a bit on QR codes and the maximum data that can be used based on the various settings, all of it being in tabular format. I can't seem to find anything giving a formula or a proper explanation of how these values are calculated.
What I would like to do is this:
Present the user with a form, allowing them to choose Format, EC & Version.
Then they can type in some data and generate a QR code.
Done deal. That part is easy.
The addition I would like to include is a "remaining character count" so that they (the user) can see how much more data they can type in, as well as what effect the properties have on the storage capacity of the QR code.
Does anyone know where I can find the formula(s)? Or do I need to purchase ISO 18004:2006?

A formula to calculate the amount of data you could put in a QRcode would be quite complex to make, not mentioning it would need some approximations for the calculation to be possible. The formula would have to calculate the amount of modules dedicated to the data in your QRCode based on its version, and then calculate how many codewords (which are sets of 8 modules) will be used for the error correction.
To calculate the amount of modules that will be used for the data, you need to know how many modules will be used for the function patterns. While this is not a problem for the three finder patterns, the timing or the version/format information, there will be a problem with the alignment patterns as their number is dependent on the QRCode's version, meaning you anyway would have to use a table at that point.
For the second part, I have to say I don't know how to calculate the number of error correcting codewords based on the correction capacity. For some reason, there are more error correcting codewords used that there should to match the error correction capacity, as for example a 6-H QRCode can correct up to 32.6% of the data, instead of the 30% set by the H correction level.
In any case, as you can see a formula would be quite complex to implement. Using a table like already suggested is probably the best thing you could do.

I wrote the original AIM specification for QR Code back in the '90s for Denso Corporation, and was also project editor for both editions of the ISO/IEC 18004 standard. It was felt to be much easier for people producing code printing software to use a look-up table rather than calculate capacities from a formula - no easy job as there are several independent variables that have to be taken into account iteratively when parsing the text to be encoded to minimise its length in bits, in order to achieve the smallest symbol. The most crucial factor is the mix of characters in the data, the sequence and lengths of sub-strings of numeric, alphanumeric, Kanji data, with the overhead needed to signal each change of character set, then the required level of error correction. I did produce a guidance section for this which is contained in the ISO standard.

The storage is calculated by the QR mode and the version/type that you are using. More specifically the calculation is based on how 'compressible' the characters are and what algorithm that the qr generator is allowed to use on the content present.
More information can be found http://en.wikipedia.org/wiki/QR_code#Storage

Speech to text in c#

I have a c# program that lets me use my microphone and when I speak, it does commands and will talk back. For example, when I say "What's the weather tomorrow?" It will reply with tomorrows weather.
The only problem is, I have to type out every phrase I want to say and have it pre-recorded. So if I want to ask for the weather, I HAVE to say it like i coded it, no variations. I am wondering if there is code to change this?
I want to be able to say "Whats the weather for tomorrow", "whats tomorrows weather" or "can you tell me tomorrows weather" and it tell me the next days weather, but i don't want to have to type in each phrase into code. I seen something out there about e.Result.Alternates, is that what I need to use?

This cannot be done without involving linguistic resources. Let me explain what I mean by this.
As you may have noticed, your C# program only recognizes pre-recorded phrases and only if you say the exact same words. (As an aside node, this is quite an achievement in itself, because you can hardly say a sentence twice without altering it a bit. Small changes, that is, e.g. in sound frequency or lengths, might not be relevant to your colleagues, but they matter to your program).
Therefore, you need to incorporate a kind of linguistic resource in your program. In other words, make it "understand" facts about human language. Two suggestions with increasing complexity below. All apporaches assume that your tool is capable of tokenizing an audio input stream in a sensible way, i.e. extract words from it.
Pattern matching
To avoid hard-coding the sentences like
Tell me about the weather.
What's the weather tomorrow?
Weather report!
you can instead define a pattern that matches any of those sentences:
if a sentence contains "weather", then output a weather report
This can be further refined in manifold ways, e.g. :
if a sentence contains "weather" and "tomorrow", output tomorrow's forecast.
if a sentence contains "weather" and "Bristol", output a forecast for Bristol
This kind of knowledge must be put into your program explicitly, for instance in the form of a dictionary or lookup table.
Measuring Similarity
If you plan to spend more time on this, you could implement a means for finding the similarity between input sentences. There are many approaches to this as well, but a prominent one is a bag of words, represented as a vector.
In this model, each sentence is represented as a vector, each word in it present as a dimension of the vector. For example, the sentence "I hate green apples" could be represented as
I = 1
hate = 1
green = 1
apples = 1
red = 0
you = 0
Note that the words that do not occur in this particular sentence, but in other phrases the program is likely to encounter, also represent dimensions (for example the red = 0).
The big advantage of this approach is that the similarity of vectors can be easily computed, no matter how multi-dimensional they are. There are several techniques that estimate similarity, one of them is cosine similarity (see for example http://en.wikipedia.org/wiki/Cosine_similarity).
On a more general note, there are many other considerations to be made of course.
For example, some words might be utterly irrelevant to the message you want to convey, as in the following sentence:
I want you to output a weather report.
Here, at least "I", "you" "to" and "a" could be done away with without damaging the basic semantics of the sentence. Such words are called stop words and are discarded early in many tools that perform speech-to-text analysis.
Also note that we started out assuming that your program reliably identifies sound input. In reality, no tool is capable of infallibly identifying speech.
Humans tend to forget that sound actually exists without cues as to where word or sentence boundaries are. This makes so-called disambiguation of input a gargantuan task that is easily underestimated - and ambiguity one of the hardest problems of computational linguistics in general.

For that, the code won't be able to judge that! You need to split the command in text array! Such as
Tomorrow
Weather
What
This way, you will compare it with the text that is present in your computer! Lets say, with the command (what) with type (weather) and with the time (tomorrow).
It is better to read and understand each word, then guess it will work as Google! Google uses the same, they break down the string and compare it.

How to detect keyword stuffing?

We are working on a kind of document search engine - primary focused around indexing user-submitted MS word documents.
We have noticed, that there is keyword-stuffing abuse.
We have determined two main kinds of abuse:
Repeating the same term, again and again
Many, irrelevant terms added to the document en-masse
These two forms of abuse are enabled, by either adding text with the same font colour as the background colour of the document, or by setting the font size to be something like 1px.
Whilst determining if the background colour is the same as the text colour, it is tricky, given the intricacies of MS word layouts - the same goes for font size - as any cut-off seems potentially arbitrary - we may accidentally remove valid text if we set a cut-off too large.
My question is - are there any standardized pre-processing or statistical analysis techniques that could be use to reduce the impact of this kind of keyword stuffing?
Any guidance would be appreciated!

There's a surprisingly simple solution to your problem using the notion of compressibility.
If you convert your Word documents to text (you can easily do that on the fly), you can then compress them (for example, use zlib library which is free) and look at the compression ratios. Normal text documents usually have a compression ratio of around 2, so any important deviation would mean that they have been "stuffed". The analyzing process is extremely easy, I have analyzed around 100k texts and it just takes around 1 minute using Python.
Another option is to look at the statistical properties of the documents/words. In order to do that, you need to have a sample of "clean" documents and calculate the mean frequency of the distinct words as well as their standard deviations.
After you had done that, you can take a new document and compare it against the mean and the deviation. Stuffed documents will be characterized as those with a few words with very high deviation from the mean from that word (documents where one or two words are repeated several times) or many words with high deviations (documents with blocks of text repeated)
Here are some useful links about compressibility:
http://www.ra.ethz.ch/cdstore/www2006/devel-www2006.ecs.soton.ac.uk/programme/files/pdf/3052.pdf
http://www.ispras.ru/ru/proceedings/docs/2011/21/isp_21_2011_277.pdf
You could also probably use the concept of entropy, for example Shannon Entropy Calculation http://code.activestate.com/recipes/577476-shannon-entropy-calculation/
Another possible solution would be to use Part-of-speech (POS) tagging. I reckon that the average percentage of nouns is similar across "normal" documents (37% percent according to http://www.ingentaconnect.com/content/jbp/ijcl/2007/00000012/00000001/art00004?crawler=true) . If the percentage were higher or lower for some POS tags, then you could possibly detect "stuffed" documents.

As Chris Sinclair commented in your question, unless you have google level algorithms (and even they get it wrong and thereby have an appeal process) it's best to flag likely keyword stuffed documents for further human review...
If a page has 100 words, and you search through the page detecting the count for the occurences of keywords (rendering stuffing by 1px or bgcolor irrelevant), thereby gaining a keyword density count, there really is no hard and fast method for a certain percentage 'allways' being keyword stuffing, generally 3-7% is normal. Perhaps if you detect 10% + then you flag it as 'potentially stuffed' and set aside for human review.
Furthermore consider these scenarios (taken from here):
Lists of phone numbers without substantial added value
Blocks of text listing cities and states a webpage is trying to rank for
and what the context of a keyword is.
Pretty damn difficult to do correctly.

Detect tag-abuse with forecolor/backcolor detection like you already do.
For size detection calculate the average text size and remove the outliers.
Also set predefined limits on the textsize (like you already do).
Next up is the structure of the tag "blobs".
For your first point you can just count the words and if one occurs too often (maybe 5x more often than the 2nd word) you can flag it as a repeated tag.
When adding tags en-mass the user often adds them all in one place, so you can see if known "fraud tags" appear next to each other (maybe with one or two words in between).
If you could identify at least some common "fraud tags" and want to get a bit more advanced then you could do the following:
Split the document into parts with the same textsize / font and analyze each part separately. For better results group parts that use nearly the same font/size, not only those that have EXACTLY the same font/size.
Count the occurrence of each known tag and when some limit set by you is exceeded this part of the document is removed or the document is flagged as "bad" (as in "uses excessiv tags")
No matter how advanced your detection is, as soon as people know its there and more or less know how it works they will find ways to circumvent it.
When that happens you should just flag the offending documents and see trough them yourself. Then if you notice that your detection algorithm got a false-positive you improve it.

If you notice a pattern in that the common stuffers are always using a font size below a certain size and that size i.e 1-5 which is not really readable then you could assume that that is the "stuffed part".
You can then go on to check if the font colour is also the same as the background colour and remove it that section.

Correcting regex patterns across languages

I found this regex pattern at http://gskinner.com/RegExr/
,(?=(?:[^"]*"[^"]*")*(?![^"]*"))
Which is for pattern matching CSV delimited values (more specifically, the separating commas, which can be split on), which on that site works excellently with my test data. You can see what I think is the JavaScript implementation in the bottom panel of the site linked when tested.
However when I attempt to implement this in C# / .net, the matching doesn't quite work properly.
My implementation:
Regex r = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))", RegexOptions.ECMAScript);
//get data...
foreach (string match in r.Split(sr.ReadLine()))
{
//lblDev.Text = lblDev.Text + match + "<br><br><br><p>column:</p><br>";
dtF.Columns.Add(match);
}
//more of the same to get rows
On some data rows the result exactly matches the result generated on the site above, but on others the first 6 or so rows fail to split or simply are not present in the split array.
Can anyone advise me on why the pattern does not appear to be behaving in the same way?
my test data:
CategoryName,SubCategoryName,SupplierName,SupplierCode,ProductTitle,Product Company ,ProductCode,Product_Index,ProductDescription,Product BestSeller,ProductDimensions,ProductExpressDays,ProductBrandName,ProductAdditionalText ,ProductPrintArea,ProductPictureRef,ProductThumnailRef,ProductQuantityBreak1 (QB1),ProductQuantityBreak2 (QB2),ProductQuantityBreak3 (QB3),ProductQuantityBreak4 (QB4),ProductPlainPrice1,ProductPlainPrice2,ProductPlainPrice3,ProductPlainPrice4,ProductColourPrice1,ProductColourPrice2,ProductColourPrice3,ProductColourPrice4,ProductExtraColour1,ProductExtraColour2,ProductExtraColour3,ProductExtraColour4,SellingPrice1,SellingPrice2,SellingPrice3,SellingPrice4,ProductCarriageCost1,ProductCarriageCost2,ProductCarriageCost3,ProductCarriageCost4,BLACK,BLUE,WHITE,SILVER,GOLD,RED,YELLOW,GREEN,ProductOtherColors,ProductOrigination,ProductOrganizationCost,ProductCatalogEntry,ProductPageNumber,ProductPersonalisationType1 (PM1),ProductPrintPosition,ProductCartonQuantity,ProductCartonWeight,ProductPricingExpering,NewProduct,ProductSpecialOffer,ProductSpecialOfferEnd,ProductIsActive,ProductRepeatOrigination,ProductCartonDimession,ProductSpecialOffer1,ProductIsExpress,ProductIsEco,ProductIsBiodegradable,ProductIsRecycled,ProductIsSustainable,ProductIsNatural
Audio,Speakers and Headphones,The Prime Time Company,CM5064:In-ear headphones,Silly Buds,,10058,372,"Small, trendy ear buds with excellent sound quality and printing area actually on each ear- piece. Plastic storage box, with room for cables be wrapped around can also be printed.",FALSE,70 x 70 x 20mm,,,,10mm dia,10058.jpg,10058.jpg,100,250,500,1000,2.19,2.13,2.06,1.99,0.1,0.1,0.05,0.05,0.1,0.1,0.05,0.05,3.81,3.71,3.42,3.17,0,0,0,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,,30,,TRUE,24,Screen Printed,Earpiece,200,11,,TRUE,,,TRUE,15,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
Audio,Speakers and Headphones,The Prime Time Company,CM5058:Headstart,Head Start,,10060,372,"Lightweight, slimline, foldable and patented headphones ideal for the gym or exercise. These
headphones uniquely hang from the ears giving security, comfort and an excellent sound quality. There is also a secret cable winding facility.",FALSE,130 x 85 x 45mm,,,,30mm dia,10060.jpg,10060.jpg,100,250,500,1000,5.6,5.43,5.26,5.09,0.1,0.1,0.05,0.05,0.1,0.1,0.05,0.05,9.47,8.96,8.24,7.97,0,0,0,0,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,,30,,TRUE,24,Screen Printed,print plate on ear (s),100,11,,TRUE,,,TRUE,15,,,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE

Use the right tool for the job. Regex is not well suited for parsing CSV which can have unlimited numbers of nested quotes.
Use this instead:
A Fast CSV Reader
http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
We use it in production code. It works great and makes you appreciate how complex parsing can be. For even more information on the complexity, check out the over 800 unit tests included in the solution.

Your C# regex works fine for me in LinqPad, but your data does include a newline within the last "row" of data. So you can't simply use sr.ReadLine() to read the data.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.