This course will cover the fundamental tools and models used for analyzing and generating natural language data in a computational setting. We will learn about the core principles behind contemporary natural language processing (NLP) methods.
The subjects covered will include the structure of natural language data and how it may be used for downstream tasks such as document search and classification, text generation, and text summarization, using both heuristic-based and neural-network based language modeling.
This course expects previous experience with the programming language Python, though no previous experience with machine learning is required.
The nature of text data
Preprocessing: tokenization and lemmatization
Bag-of-words topic models and naive classification
N-gram language models (Markov models)
Hidden Markov models and part-of-speech tagging
Distributed representations and vector semantics
Recurrent neural language models LSTMs and language generation
Transformers and masked language modeling
Encoder models and semantic search
Encoder-decoder models text summarization and translation