FastText Tutorial – We shall learn how to make a model learn Word Representations in FastText by training word vectors using unsupervised learning. The tutorial uses a small text file to show the command flow for CBOW, SkipGram, word vectors, and sentence vectors.

How FastText Learns Word Representations

Word representations convert words into dense numerical vectors that machine learning models can work with. FastText learns these vectors from raw text and also uses character n-grams, which helps it build useful representations for rare words and word forms that share subword patterns.

Unsupervised FastText training does not need labels. It only needs a plain text corpus that is similar to the text you want to process later. For example, train on product reviews for review analysis, support tickets for support automation, and news articles for news-related NLP tasks.

Input Data for FastText Word Representation Training

Unlike supervised learning, unsupervised learning doesn’t require labelled data. So, any of the word dumps could be used as input data to train the model for learning word representations. For example, you may find many dumps from wiki at https://dumps.wikimedia.org/enwiki/latest/, if you want to try training your model with huge amount of data corpus.

For this tutorial, we shall use sample data as shown below :

Text Classification is one of the important NLP (Natural Language Processing) task with wide range of application in solving problems like Document Classification, Sentiment Analysis, Email SPAM Classification, Tweet Classification etc.

FastText provides “supervised” module to build a model for Text Classification using Supervised learning.

To work with fastText, it has to be built from source. To build fastText, follow the fastText Tutorial – How to build FastText library from github source. Once fastText is built, run the fasttext commands mentioned in the following tutorial from the location of fasttest executable.

Save this text in a file named wordRepTrainingData.txt. This sample is only for learning the commands. For a useful model, use a larger, cleaned, domain-specific corpus and keep casing, punctuation, and tokenization consistent.

CBOW and SkipGram Techniques in FastText

To train word vectors, FastText provides two techniques. They are

  • Continuous Bag Of Words (CBOW)
  • SkipGram

CBOW learns by using surrounding context to predict a word. SkipGram learns by using a word to predict surrounding context. CBOW is a good first choice for a fast baseline, while SkipGram is worth trying when less frequent words are important. In real projects, train both and evaluate the vectors on your downstream NLP task.

Training Continuous Bag Of Words (CBOW) Model in FastText

Following is the syntax to train word vectors using CBOW model.

$ ./fasttext cbow -input <input_file> -output <output_file>

CBOW Example for wordRepTrainingData.txt

We shall use the data in a text file that is provided in the input data section, as training data.

$ ./fasttext cbow -input wordRepTrainingData.txt -output cbowModel
$ ./fasttext cbow -input wordRepTrainingData.txt -output cbowModel
Read 0M words
Number of words:  2
Number of labels: 0
Progress: 100.0%  words/sec/thread: 33  lr: 0.000000  loss: 0.000000  eta: 0h0m 

cbowModel.bin is created after training.

If you are testing with a very small demo file, you may need -minCount 1 so that words appearing once are not filtered out.

</>
Copy
./fasttext cbow -input wordRepTrainingData.txt -output cbowModelDemo -minCount 1

Training SkipGram Model in FastText

Following is the syntax to train word vectors using SkipGram model.

$ ./fasttext skipgram -input <input_file> -output <output_file>

SkipGram Example for wordRepTrainingData.txt

We shall use the data in a text file that is provided in the input data as training data.

$ ./fasttext skipgram -input wordRepTrainingData.txt -output cbowModel

The example above shows the SkipGram command structure. To keep CBOW and SkipGram model files separate, use a SkipGram-specific output prefix when you run the command.

</>
Copy
./fasttext skipgram -input wordRepTrainingData.txt -output skipGramModel
$ ./fasttext skipgram -input wordRepTrainingData.txt -output skipGramModel
Read 0M words
Number of words: 2
Number of labels: 0
Progress: 100.0% words/sec/thread: 27 lr: 0.000000 loss: 0.000000 eta: 0h0m 

skipGramModel.bin is created after training.

FastText Model Files Created After Training

FastText uses the value passed to -output as a file prefix. For example, -output cbowModel creates files such as cbowModel.bin and usually cbowModel.vec. The .bin file is the binary model file used by FastText commands. The .vec file is a text vector file that can be useful in tools that expect plain text vectors.

Useful FastText Parameters for Better Word Vectors

For real training, adjust the command instead of relying only on defaults. Common options include -dim for vector size, -epoch for training passes, -minCount for vocabulary filtering, -minn and -maxn for character n-gram lengths, and -thread for parallel training.

</>
Copy
./fasttext cbow -input corpus.txt -output cbowModel -dim 100 -epoch 5 -minCount 5 -minn 3 -maxn 6 -thread 4

Print Word Vectors from a FastText Model

Once the model is generated, we shall have a look on how to calculate word vectors for some input words :

Example : Calculate word vector for the word “Classification”

$ echo "Classification" | ./fasttext print-word-vectors cbowModel.bin 
Classification -0.0016351 -0.00038951 -0.00069403 -0.00055687 4.6813e-05 0.00084484 -0.00032377 -0.0014186 -0.00010761 0.00096472 0.00041914 0.0018084 -0.00021441 0.0016066 -0.00025791 -0.00013698 0.0015549 0.00080067 -0.0011226 -0.0001057 0.00077716 3.0814e-05 -0.0008903 0.00051218 0.0010777 -0.00021787 0.0004454 -9.1978e-05 0.0013804 -0.00065836 -0.00012421 0.00090651 -0.00076955 0.00015702 -6.6829e-05 0.00037686 -0.00082451 -0.00089599 -4.8236e-05 0.0011861 -0.00053301 0.0013759 -0.00050949 -0.00052694 -0.00025271 0.00018434 0.00069015 0.00022772 -0.0006613 -0.00024038 0.00082301 -0.001342 -0.00023147 4.6686e-05 -0.0021591 -0.0012267 0.00016453 -7.0963e-05 0.00012941 -0.00033523 -0.00025687 -0.0016622 0.0011311 0.00031574 0.00051476 0.00021078 -0.0010296 -0.00077612 -0.0002647 0.00040547 0.00022524 7.8208e-06 -0.0012234 -0.0012435 0.00084114 -0.0021134 -0.00032346 -0.00037915 -0.0011645 -0.00055294 0.000298 0.00022919 -0.00040574 0.0010034 0.00027639 0.00071129 -0.00096475 -0.00088694 -0.00020765 0.00017506 -0.00074152 -0.00063677 -0.0018727 -0.00081131 -0.00027694 0.00061828 -0.00024931 -0.0011524 0.00021265 -0.00024279

The first item in the output is the input word. The remaining numbers are the vector values for that word.

</>
Copy
printf "Text\nClassification\nFastText\n" | ./fasttext print-word-vectors cbowModel.bin

Print Sentence Vectors from a FastText Model

We could also calculate sentence vectors using the CBOW and SkipGram models that we generated.

Example: Calculate sentence vector for the sentence “Text Classification”

$ echo "Text Classification" | ./fasttext print-sentence-vectors cbowModel.bin 
Text Classification -0.10849 0.0073465 0.010102 -0.063361 -0.059639 0.056901 -0.06169 -0.04626 0.015623 0.079396 0.063662 0.13331 -0.10584 0.1265 -0.070325 -0.094202 0.082804 0.066358 -0.033852 0.039573 0.0044317 -0.042774 -0.14243 0.010955 0.053763 0.011553 0.072239 -0.10154 0.007844 -0.028087 -0.057292 0.016036 -0.11378 0.026555 -0.043418 -0.00021922 0.053161 -0.024643 0.044737 0.11826 -0.086438 0.062033 0.0086412 -0.064439 0.044403 -0.030381 0.073831 0.0065884 -0.14511 0.049224 0.1389 -0.0043203 0.05156 0.028902 -0.15638 -0.11769 0.01515 0.050197 0.025984 -0.030021 -0.028685 -0.12303 0.0008013 0.084163 0.025181 0.016443 -0.08329 -0.0037237 -0.016232 0.044954 -0.0032083 0.008169 -0.10068 -0.12146 -0.013546 -0.27842 -0.042486 -0.088876 -0.084226 -0.0492 0.096401 0.01784 -0.028391 0.019633 0.09417 0.10986 -0.055056 -0.051792 -0.11848 0.025789 -0.013399 -0.12246 -0.11678 -0.018821 0.07682 0.007471 0.015359 -0.003884 -0.02354 -0.0035358 

We have printed word and sentence vectors using CBOW model. You may try with SkipGram model as a practice. All you need to do is providing skipGramModel.bin instead of cbowModel.bin in the commands.

FastText Word Representation FAQ

How do I make a model learn word representations in FastText?

Create a plain text corpus and run ./fasttext cbow or ./fasttext skipgram with -input and -output. FastText reads the corpus and creates model files for word vectors.

Why does FastText show only a few words in the training output?

The corpus may be too small, or -minCount may be filtering out low-frequency words. For a small demo file, use -minCount 1. For a real model, use more training text.

Can FastText generate vectors for unseen words?

FastText can often build a vector for an unseen word from character n-grams, as long as subword information is enabled in the trained model.

Which file should I use after FastText training?

Use the .bin file with FastText commands such as print-word-vectors. Use the .vec file when another tool needs plain text vectors.

Reference Links for FastText Word Vector Training

Editorial QA Checklist for FastText Word Representation Tutorial

  • The tutorial clearly separates unsupervised word vector training from supervised text classification.
  • The CBOW and SkipGram examples use separate output prefixes in the corrected commands.
  • The sample corpus is marked as a demo corpus, not a reliable source for production vectors.
  • The article explains why -minCount 1 can be useful for very small tutorial data.
  • The tutorial includes both word vector and sentence vector commands.

Conclusion

In this FastText Tutorial, we have learnt to make a model learn Word Representations in FastText using Unsupervised Learning techniques – CBOW (Continuous Bag of Words) and SkipGram. We also calculated word vectors and sentence vectors from the trained model.