FastText Python - Learn Word Representations

In this tutorial, we learn how to make a model learn word representations using FastText in Python by training word vectors with unsupervised learning techniques. We cover the original CBOW and SkipGram examples, the current Python API pattern, input text preparation, model loading, word vector lookup, nearest-neighbour queries, and practical checks before using the vectors in an NLP project.

FastText Python Word Representations: What You Will Build

FastText represents words as dense numeric vectors. Unlike basic word-vector methods that treat every word as a single token, FastText can use subword information, so it can produce useful representations for rare words, misspellings, and related word forms when the training data supports them.

For training using machine learning, words and sentences could be represented in a more numerical and efficient way called Word Vectors. FastText provides tools to learn these word representations, that could boost accuracy numbers for text classification and such.

Install the FastText Python package.
Prepare a plain text training corpus.
Train unsupervised word vectors with CBOW or SkipGram.
Save, load, and query the trained FastText model.
Use the learned vectors in downstream NLP tasks.

Learn Word Representations in FastText

The basic workflow is simple: keep your training text in a plain text file, train a FastText model using an unsupervised method, then use the trained model to retrieve word vectors or related words. The quality of the vectors depends heavily on the amount, cleanliness, and relevance of the text corpus.

1. Install FastText in Python for word-vector training

Cython is a prerequisite to install fasttext. To install Cython, run the following command in Terminal :

$ pip install Cython --install-option="--no-cython-compile"

To use fasttext in python program, install it using the following command :

$ pip install fasttext

root@arjun-VPCEH26EN:~# pip install fasttext
Collecting fasttext
  Using cached fasttext-0.8.3.tar.gz
Collecting numpy>=1 (from fasttext)
  Downloading numpy-1.13.1-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB)
    100% |????????????????????????????????| 16.6MB 48kB/s 
Collecting future (from fasttext)
  Downloading future-0.16.0.tar.gz (824kB)
    100% |????????????????????????????????| 829kB 228kB/s 
Building wheels for collected packages: fasttext, future
  Running setup.py bdist_wheel for fasttext ... done
  Stored in directory: /root/.cache/pip/wheels/55/0a/95/e23f773666d3487ee7456b220f7e8d37e99b74833b20dd06a0
  Running setup.py bdist_wheel for future ... done
  Stored in directory: /root/.cache/pip/wheels/c2/50/7c/0d83b4baac4f63ff7a765bd16390d2ab43c93587fac9d6017a
Successfully built fasttext future
Installing collected packages: numpy, future, fasttext
Successfully installed fasttext-0.8.3 future-0.16.0 numpy-1.13.1
root@arjun-VPCEH26EN:~#

FastText is successfully installed in Python.

For current Python environments, it is usually better to install FastText inside a virtual environment and use python -m pip so that the package is installed for the same Python interpreter that runs your program.

</>

Copy

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install fasttext

On Windows PowerShell, activate the virtual environment with the following command before installing FastText.

</>

Copy

.venv\Scripts\Activate.ps1
python -m pip install fasttext

2. Prepare input text data for FastText unsupervised training

But, please remember that, for any useful model to be trained, you may need lot of data corpus w.r.t your use case, at least a billion words. Input could be given as a text file.

For a small practice run, you can start with a few lines of text. For a production-quality word representation model, use a large domain-specific corpus. Each line may contain a sentence or a paragraph. Keep the file in UTF-8 text format.

</>

Copy

machine learning models learn from data
fasttext learns word vectors using subword information
word embeddings represent words as numeric vectors
text classification can use word vectors as features

Save the input as TrainingData.txt or any filename of your choice. The examples below assume that the training file is available in the same directory as the Python script.

3. Choose CBOW or SkipGram for FastText word representations

To train word vectors, FastText provides two techniques. They are

Continuous Bag Of Words (CBOW)
SkipGram

CBOW learns by predicting a word from its surrounding context. SkipGram learns by predicting surrounding context words from a target word. In practice, CBOW is often a good starting point for faster training, while SkipGram is commonly tried when rare-word behaviour is important. Test both on your own dataset if the final application is sensitive to embedding quality.

FastText model type	How it learns	When to try it
CBOW	Predicts the current word from nearby context words.	Good first choice when training speed matters.
SkipGram	Predicts nearby context words from the current word.	Useful to compare when rare words and richer word relationships matter.

4. Train a CBOW model with FastText Python

Following is the example to build a CBOW model.

</>

Copy

import fasttext

# CBOW model
model = fasttext.cbow('TrainingData.txt', 'model')
print model.words # list of words in dictionary

print model['machine'] # get the vector of the word 'machine'

Running the above python program creates two files. One is model file (with .bin extension) containing trained parameters and the other is vector file (with .vec extension) containing vector representations of words in the training data file.

The example above uses the older FastText Python API. In current versions of the official Python module, use train_unsupervised() with model='cbow'.

</>

Copy

import fasttext

model = fasttext.train_unsupervised(
    input='TrainingData.txt',
    model='cbow',
    dim=100,
    epoch=5,
    lr=0.05
)

model.save_model('cbowModel.bin')
print(model.get_words()[:10])
print(model.get_word_vector('machine'))

The .bin file stores the trained FastText model. You can load this file later to generate vectors, query nearest neighbours, or continue using the model in another Python program.

5. Train a SkipGram model with FastText Python

Following is the example to build a CBOW model.

</>

Copy

import fasttext

# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary

print model['machine'] # get the vector of the word 'machine'

For the current API, set model='skipgram' in train_unsupervised().

</>

Copy

import fasttext

model = fasttext.train_unsupervised(
    input='TrainingData.txt',
    model='skipgram',
    dim=100,
    epoch=5,
    lr=0.05
)

model.save_model('skipgramModel.bin')
print(model.get_nearest_neighbors('machine'))

The output of get_nearest_neighbors() is a list of related words with similarity scores. With a tiny training file, these results may not be meaningful. Use a larger corpus before judging model quality.

6. Use a trained FastText word representation model

To use a trained model (the output of above cbow model training or skipgram model training) at some other computer or in future, following example demonstrates the usage.

</>

Copy

import fasttext
model = fasttext.load_model('cbowModel.bin')
print model['machine'] # get the vector of the word 'machine'

In current Python syntax, use print() and the model helper methods as shown below.

</>

Copy

import fasttext

model = fasttext.load_model('cbowModel.bin')
vector = model.get_word_vector('machine')

print(vector)
print(len(vector))

The vector length matches the dimension used during training. If you trained with dim=100, each word vector has 100 numeric values.

7. Print all words in the FastText model dictionary

To get the list of all words in the dictionary (model), following example python program demonstrates the usage.

</>

Copy

import fasttext
model = fasttext.load_model('cbowModel.bin')
print model.words # list of words in dictionary

With current FastText Python versions, use get_words() to read the dictionary learned during training.

</>

Copy

import fasttext

model = fasttext.load_model('cbowModel.bin')
words = model.get_words()

print('Number of words:', len(words))
print(words[:20])

FastText Word Vector Parameters That Matter in Python

FastText exposes several training parameters. You do not need to tune every parameter for a first model, but the following options are useful to understand before training on a larger corpus.

Parameter	Meaning	Practical note
`model`	Selects `cbow` or `skipgram`.	Train both and compare results if embedding quality is important.
`dim`	Number of values in each word vector.	Common experiments use dimensions such as 100, 200, or 300.
`epoch`	Number of passes over the training data.	More epochs may help on small data but can overfit noisy patterns.
`lr`	Learning rate.	Controls update size during training.
`minCount`	Minimum word frequency to include in the dictionary.	Lower it for small corpora; raise it to ignore very rare tokens in large corpora.
`minn` and `maxn`	Minimum and maximum character n-gram lengths.	These control the subword information used by FastText.

For example, the following command trains a SkipGram model with a 200-dimensional vector size and includes words that appear at least twice.

</>

Copy

import fasttext

model = fasttext.train_unsupervised(
    input='TrainingData.txt',
    model='skipgram',
    dim=200,
    epoch=10,
    minCount=2,
    minn=3,
    maxn=6
)

model.save_model('fasttext_skipgram_200d.bin')

Check FastText Word Representations After Training

After training, check whether the model learned useful relationships for your domain. A simple check is to inspect nearest neighbours for important words and verify whether the results are sensible.

</>

Copy

import fasttext

model = fasttext.load_model('cbowModel.bin')

for score, word in model.get_nearest_neighbors('machine', k=5):
    print(score, word)

You can also test out-of-vocabulary behaviour. Because FastText uses character n-grams, it can return a vector for a word that was not directly seen in the training dictionary, although the vector is only useful when the subword patterns are meaningful.

</>

Copy

import fasttext

model = fasttext.load_model('cbowModel.bin')

known_word_vector = model.get_word_vector('machine')
new_word_vector = model.get_word_vector('machinelike')

print(len(known_word_vector))
print(len(new_word_vector))

Using FastText Word Vectors in NLP Projects

Once trained, FastText word representations can be used as features for text classification, clustering, semantic search, document similarity, and other NLP workflows. A common simple document representation is the average of the vectors of the words in that document.

</>

Copy

import numpy as np
import fasttext

model = fasttext.load_model('cbowModel.bin')

def sentence_vector(sentence):
    words = sentence.lower().split()
    if not words:
        return np.zeros(model.get_dimension())
    vectors = [model.get_word_vector(word) for word in words]
    return np.mean(vectors, axis=0)

vector = sentence_vector('machine learning model')
print(vector.shape)

This averaging approach is simple and useful for learning. For serious applications, evaluate it against your task requirements and compare it with models designed specifically for sentence or document embeddings.

Common FastText Python Word Representation Issues

Issue	Likely reason	Fix
`AttributeError` for `fasttext.cbow` or `fasttext.skipgram`	The code uses the older Python API.	Use `fasttext.train_unsupervised()` with `model='cbow'` or `model='skipgram'`.
Nearest neighbours look random	The training corpus is too small, too noisy, or unrelated to the query word.	Train on a larger and cleaner domain-specific corpus.
The vocabulary is too small	`minCount` may be too high for the corpus size.	Lower `minCount` when experimenting with small datasets.
Training takes too long	The corpus is large or the selected dimensions and epochs are high.	Start with fewer epochs or a smaller dimension, then scale after checking results.
Words with punctuation become separate tokens	The input text was not normalized before training.	Clean or normalize text consistently before creating the training file.

FastText Python Word Representations FAQs

What is a word representation in FastText?

A word representation in FastText is a numeric vector that captures information learned from the word and its surrounding context in the training corpus. FastText can also use character n-grams, which helps it represent related word forms and some words not directly present in the dictionary.

Should I train FastText word vectors with CBOW or SkipGram?

Use CBOW as a fast starting point and try SkipGram when rare-word behaviour or richer word relationships are important. The better choice depends on the dataset and the downstream task, so compare both with a small evaluation set when possible.

Why does old FastText Python code use fasttext.cbow() and fasttext.skipgram()?

Older FastText Python examples used helper functions such as fasttext.cbow() and fasttext.skipgram(). Current examples generally use fasttext.train_unsupervised() and pass model='cbow' or model='skipgram'.

How much data is needed to train useful FastText word vectors?

A tiny text file is enough to test the code, but useful word vectors need a large and relevant corpus. For domain-specific NLP, the training text should contain enough examples of the words and contexts that your application will handle.

Can FastText give a vector for a word not seen during training?

Yes. FastText can use subword information to produce a vector for an unseen word. The result is more reliable when the unseen word shares meaningful character patterns with words seen during training.

Editorial QA Checklist for This FastText Python Tutorial

The original FastText code blocks are preserved, and newer API examples are added separately.
The article explains CBOW and SkipGram as unsupervised FastText training methods.
The tutorial makes clear that a small corpus is only for testing and that useful word vectors need relevant training data.
New command-line examples use language-bash, and new syntax-only examples use syntax with the correct PrismJS language class.
The FastText model-loading examples show how to retrieve vectors, vocabulary, and nearest neighbours.
The FAQ section answers FastText Python questions specific to word representations, not generic NLP questions.

Summary: Learning Word Representations with FastText Python

In this FastText Tutorial, we have learnt how to make models learn word representations using unsupervised learning techniques using fasttext in python programming language. The core steps are to prepare a text corpus, choose CBOW or SkipGram, train the model, save it as a .bin file, load it later, and use the model to retrieve word vectors or nearest neighbours. For current FastText Python code, prefer train_unsupervised(), get_word_vector(), get_words(), and get_nearest_neighbors().

TutorialKart.com

FastText Python – Learn Word Representations

FastText Python Word Representations: What You Will Build

Learn Word Representations in FastText

1. Install FastText in Python for word-vector training

2. Prepare input text data for FastText unsupervised training

3. Choose CBOW or SkipGram for FastText word representations

4. Train a CBOW model with FastText Python

5. Train a SkipGram model with FastText Python

6. Use a trained FastText word representation model

7. Print all words in the FastText model dictionary

FastText Word Vector Parameters That Matter in Python

Check FastText Word Representations After Training

Using FastText Word Vectors in NLP Projects

Common FastText Python Word Representation Issues

FastText Python Word Representations FAQs

Editorial QA Checklist for This FastText Python Tutorial

Summary: Learning Word Representations with FastText Python

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning