3 Text embeddings

 

This chapter covers:

  • The preparation of texts for deep learning purposes, using word and document embeddings.
  • Benefits and drawbacks of self-developed text embeddings and pre-trained embeddings developed by others.
  • Word similarity with word2vec.
  • Document retrieval via document embeddings, using doc2vec.

This diagram shows the organization of the chapter:

mental model chapter3 all

After reading this chapter, you will have a practical command of text embedding algorithms, and you will have developed insight into how to use embeddings for NLP. We will go through a number of concrete scenarios to reach that goal.

But first, let’s review the basics of embeddings.

3.1  Embeddings

mental model chapter3 block1

Embeddings are procedures for converting input data into vector representations. A vector is a container (like an array) containing numbers. Every vector lives in a multidimensional vector space, as a single point. Embeddings are systematic, well-crafted procedures for projecting ('embedding') input data into such a space.

We have seen ample vector representations of texts in Chapter 1 and Chapter 2, such as one-hot vectors (binary-valued vectors with one bit 'on' for a specific word), used for bag-of-word representations, frequency-based vectors, and word2vec representations. All these vector representations were created by embeddings.

3.1.1  Embedding by hand: representational embeddings

3.1.2  Learning to embed: procedural embeddings

3.2  From words to vectors: word2vec

3.3  From documents to vectors: doc2vec

3.4  Wrapping up

3.5  External resources

sitemap