In this tutorial, we learn how to make a model learn word representations using FastText in Python by training word vectors with unsupervised learning techniques. We cover the original CBOW and SkipGram examples, the current Python API pattern, input text preparation, model loading, word vector lookup, nearest-neighbour queries, and practical checks before using the vectors in an NLP project.
FastText Python Word Representations: What You Will Build
FastText represents words as dense numeric vectors. Unlike basic word-vector methods that treat every word as a single token, FastText can use subword information, so it can produce useful representations for rare words, misspellings, and related word forms when the training data supports them.
For training using machine learning, words and sentences could be represented in a more numerical and efficient way called Word Vectors. FastText provides tools to learn these word representations, that could boost accuracy numbers for text classification and such.
- Install the FastText Python package.
- Prepare a plain text training corpus.
- Train unsupervised word vectors with CBOW or SkipGram.
- Save, load, and query the trained FastText model.
- Use the learned vectors in downstream NLP tasks.
Learn Word Representations in FastText
The basic workflow is simple: keep your training text in a plain text file, train a FastText model using an unsupervised method, then use the trained model to retrieve word vectors or related words. The quality of the vectors depends heavily on the amount, cleanliness, and relevance of the text corpus.
1. Install FastText in Python for word-vector training
Cython is a prerequisite to install fasttext. To install Cython, run the following command in Terminal :
$ pip install Cython --install-option="--no-cython-compile"
To use fasttext in python program, install it using the following command :
$ pip install fasttext
root@arjun-VPCEH26EN:~# pip install fasttext
Collecting fasttext
Using cached fasttext-0.8.3.tar.gz
Collecting numpy>=1 (from fasttext)
Downloading numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB)
100% |????????????????????????????????| 16.6MB 48kB/s
Collecting future (from fasttext)
Downloading future-0.16.0.tar.gz (824kB)
100% |????????????????????????????????| 829kB 228kB/s
Building wheels for collected packages: fasttext, future
Running setup.py bdist_wheel for fasttext ... done
Stored in directory: /root/.cache/pip/wheels/55/0a/95/e23f773666d3487ee7456b220f7e8d37e99b74833b20dd06a0
Running setup.py bdist_wheel for future ... done
Stored in directory: /root/.cache/pip/wheels/c2/50/7c/0d83b4baac4f63ff7a765bd16390d2ab43c93587fac9d6017a
Successfully built fasttext future
Installing collected packages: numpy, future, fasttext
Successfully installed fasttext-0.8.3 future-0.16.0 numpy-1.13.1
root@arjun-VPCEH26EN:~#
FastText is successfully installed in Python.
For current Python environments, it is usually better to install FastText inside a virtual environment and use python -m pip so that the package is installed for the same Python interpreter that runs your program.
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install fasttext
On Windows PowerShell, activate the virtual environment with the following command before installing FastText.
.venv\Scripts\Activate.ps1
python -m pip install fasttext
2. Prepare input text data for FastText unsupervised training
But, please remember that, for any useful model to be trained, you may need lot of data corpus w.r.t your use case, at least a billion words. Input could be given as a text file.
For a small practice run, you can start with a few lines of text. For a production-quality word representation model, use a large domain-specific corpus. Each line may contain a sentence or a paragraph. Keep the file in UTF-8 text format.
machine learning models learn from data
fasttext learns word vectors using subword information
word embeddings represent words as numeric vectors
text classification can use word vectors as features
Save the input as TrainingData.txt or any filename of your choice. The examples below assume that the training file is available in the same directory as the Python script.
3. Choose CBOW or SkipGram for FastText word representations
To train word vectors, FastText provides two techniques. They are
- Continuous Bag Of Words (CBOW)
- SkipGram
CBOW learns by predicting a word from its surrounding context. SkipGram learns by predicting surrounding context words from a target word. In practice, CBOW is often a good starting point for faster training, while SkipGram is commonly tried when rare-word behaviour is important. Test both on your own dataset if the final application is sensitive to embedding quality.
| FastText model type | How it learns | When to try it |
|---|---|---|
| CBOW | Predicts the current word from nearby context words. | Good first choice when training speed matters. |
| SkipGram | Predicts nearby context words from the current word. | Useful to compare when rare words and richer word relationships matter. |
4. Train a CBOW model with FastText Python
Following is the example to build a CBOW model.
import fasttext
# CBOW model
model = fasttext.cbow('TrainingData.txt', 'model')
print model.words # list of words in dictionary
print model['machine'] # get the vector of the word 'machine'
Running the above python program creates two files. One is model file (with .bin extension) containing trained parameters and the other is vector file (with .vec extension) containing vector representations of words in the training data file.
The example above uses the older FastText Python API. In current versions of the official Python module, use train_unsupervised() with model='cbow'.
import fasttext
model = fasttext.train_unsupervised(
input='TrainingData.txt',
model='cbow',
dim=100,
epoch=5,
lr=0.05
)
model.save_model('cbowModel.bin')
print(model.get_words()[:10])
print(model.get_word_vector('machine'))
The .bin file stores the trained FastText model. You can load this file later to generate vectors, query nearest neighbours, or continue using the model in another Python program.
5. Train a SkipGram model with FastText Python
Following is the example to build a CBOW model.
import fasttext
# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary
print model['machine'] # get the vector of the word 'machine'
Running the above python program creates two files. One is model file (with .bin extension) containing trained parameters and the other is vector file (with .vec extension) containing vector representations of words in the training data file.
For the current API, set model='skipgram' in train_unsupervised().
import fasttext
model = fasttext.train_unsupervised(
input='TrainingData.txt',
model='skipgram',
dim=100,
epoch=5,
lr=0.05
)
model.save_model('skipgramModel.bin')
print(model.get_nearest_neighbors('machine'))
The output of get_nearest_neighbors() is a list of related words with similarity scores. With a tiny training file, these results may not be meaningful. Use a larger corpus before judging model quality.
6. Use a trained FastText word representation model
To use a trained model (the output of above cbow model training or skipgram model training) at some other computer or in future, following example demonstrates the usage.
import fasttext
model = fasttext.load_model('cbowModel.bin')
print model['machine'] # get the vector of the word 'machine'
In current Python syntax, use print() and the model helper methods as shown below.
import fasttext
model = fasttext.load_model('cbowModel.bin')
vector = model.get_word_vector('machine')
print(vector)
print(len(vector))
The vector length matches the dimension used during training. If you trained with dim=100, each word vector has 100 numeric values.
7. Print all words in the FastText model dictionary
To get the list of all words in the dictionary (model), following example python program demonstrates the usage.
import fasttext
model = fasttext.load_model('cbowModel.bin')
print model.words # list of words in dictionary
With current FastText Python versions, use get_words() to read the dictionary learned during training.
import fasttext
model = fasttext.load_model('cbowModel.bin')
words = model.get_words()
print('Number of words:', len(words))
print(words[:20])
FastText Word Vector Parameters That Matter in Python
FastText exposes several training parameters. You do not need to tune every parameter for a first model, but the following options are useful to understand before training on a larger corpus.
| Parameter | Meaning | Practical note |
|---|---|---|
model | Selects cbow or skipgram. | Train both and compare results if embedding quality is important. |
dim | Number of values in each word vector. | Common experiments use dimensions such as 100, 200, or 300. |
epoch | Number of passes over the training data. | More epochs may help on small data but can overfit noisy patterns. |
lr | Learning rate. | Controls update size during training. |
minCount | Minimum word frequency to include in the dictionary. | Lower it for small corpora; raise it to ignore very rare tokens in large corpora. |
minn and maxn | Minimum and maximum character n-gram lengths. | These control the subword information used by FastText. |
For example, the following command trains a SkipGram model with a 200-dimensional vector size and includes words that appear at least twice.
import fasttext
model = fasttext.train_unsupervised(
input='TrainingData.txt',
model='skipgram',
dim=200,
epoch=10,
minCount=2,
minn=3,
maxn=6
)
model.save_model('fasttext_skipgram_200d.bin')
Check FastText Word Representations After Training
After training, check whether the model learned useful relationships for your domain. A simple check is to inspect nearest neighbours for important words and verify whether the results are sensible.
import fasttext
model = fasttext.load_model('cbowModel.bin')
for score, word in model.get_nearest_neighbors('machine', k=5):
print(score, word)
You can also test out-of-vocabulary behaviour. Because FastText uses character n-grams, it can return a vector for a word that was not directly seen in the training dictionary, although the vector is only useful when the subword patterns are meaningful.
import fasttext
model = fasttext.load_model('cbowModel.bin')
known_word_vector = model.get_word_vector('machine')
new_word_vector = model.get_word_vector('machinelike')
print(len(known_word_vector))
print(len(new_word_vector))
Using FastText Word Vectors in NLP Projects
Once trained, FastText word representations can be used as features for text classification, clustering, semantic search, document similarity, and other NLP workflows. A common simple document representation is the average of the vectors of the words in that document.
import numpy as np
import fasttext
model = fasttext.load_model('cbowModel.bin')
def sentence_vector(sentence):
words = sentence.lower().split()
if not words:
return np.zeros(model.get_dimension())
vectors = [model.get_word_vector(word) for word in words]
return np.mean(vectors, axis=0)
vector = sentence_vector('machine learning model')
print(vector.shape)
This averaging approach is simple and useful for learning. For serious applications, evaluate it against your task requirements and compare it with models designed specifically for sentence or document embeddings.
Common FastText Python Word Representation Issues
| Issue | Likely reason | Fix |
|---|---|---|
AttributeError for fasttext.cbow or fasttext.skipgram | The code uses the older Python API. | Use fasttext.train_unsupervised() with model='cbow' or model='skipgram'. |
| Nearest neighbours look random | The training corpus is too small, too noisy, or unrelated to the query word. | Train on a larger and cleaner domain-specific corpus. |
| The vocabulary is too small | minCount may be too high for the corpus size. | Lower minCount when experimenting with small datasets. |
| Training takes too long | The corpus is large or the selected dimensions and epochs are high. | Start with fewer epochs or a smaller dimension, then scale after checking results. |
| Words with punctuation become separate tokens | The input text was not normalized before training. | Clean or normalize text consistently before creating the training file. |
FastText Python Word Representations FAQs
What is a word representation in FastText?
A word representation in FastText is a numeric vector that captures information learned from the word and its surrounding context in the training corpus. FastText can also use character n-grams, which helps it represent related word forms and some words not directly present in the dictionary.
Should I train FastText word vectors with CBOW or SkipGram?
Use CBOW as a fast starting point and try SkipGram when rare-word behaviour or richer word relationships are important. The better choice depends on the dataset and the downstream task, so compare both with a small evaluation set when possible.
Why does old FastText Python code use fasttext.cbow() and fasttext.skipgram()?
Older FastText Python examples used helper functions such as fasttext.cbow() and fasttext.skipgram(). Current examples generally use fasttext.train_unsupervised() and pass model='cbow' or model='skipgram'.
How much data is needed to train useful FastText word vectors?
A tiny text file is enough to test the code, but useful word vectors need a large and relevant corpus. For domain-specific NLP, the training text should contain enough examples of the words and contexts that your application will handle.
Can FastText give a vector for a word not seen during training?
Yes. FastText can use subword information to produce a vector for an unseen word. The result is more reliable when the unseen word shares meaningful character patterns with words seen during training.
Editorial QA Checklist for This FastText Python Tutorial
- The original FastText code blocks are preserved, and newer API examples are added separately.
- The article explains CBOW and SkipGram as unsupervised FastText training methods.
- The tutorial makes clear that a small corpus is only for testing and that useful word vectors need relevant training data.
- New command-line examples use
language-bash, and new syntax-only examples usesyntaxwith the correct PrismJS language class. - The FastText model-loading examples show how to retrieve vectors, vocabulary, and nearest neighbours.
- The FAQ section answers FastText Python questions specific to word representations, not generic NLP questions.
Summary: Learning Word Representations with FastText Python
In this FastText Tutorial, we have learnt how to make models learn word representations using unsupervised learning techniques using fasttext in python programming language. The core steps are to prepare a text corpus, choose CBOW or SkipGram, train the model, save it as a .bin file, load it later, and use the model to retrieve word vectors or nearest neighbours. For current FastText Python code, prefer train_unsupervised(), get_word_vector(), get_words(), and get_nearest_neighbors().
TutorialKart.com