Vocabulary

The set of unique tokens known to a language model or NLP system.

Description

In natural language processing, vocabulary refers to the set of unique tokens that a model or system recognizes. This can include words, subwords, or characters, depending on the tokenization method used. The vocabulary is typically built from the training data and has a significant impact on the model's ability to understand and generate text. The size and composition of the vocabulary can affect model performance, memory usage, and the ability to handle out-of-vocabulary words.

Examples

  • 📚 Word-level vocabulary
  • 🧩 Subword vocabulary (e.g., in BERT or GPT models)
  • 🔤 Character-level vocabulary

Applications

📝 Language modeling
🌐 Machine translation
✍️ Text generation

Related Terms

Featured

Vidnoz AI: Create Free AI Videos in 1 Minute