Gaussian Processes for Natural Language Processing
Gaussian Processes (GPs) [Rasmussen and Williams, 2006] are a powerful modelling framework incorporating kernels and Bayesian inference, and are recognised as state-of-the-art for many machine learning tasks. Despite this, only recently it has been applied to natural language processing (NLP) applications, even though it has many properties that are appealing for the ﬁeld. First, GPs are non-parametric, which results in ﬂexible models that can adapt their capacity to the available data. Second, their Bayesian formulation helps to prevent overﬁtting and propagates uncertainty, which is rampant in natural language. Finally, they allow the use of kernels (a type of similarity metric) speciﬁcally tailored for texts, potentially lessening the need for expensive feature engineering. Overall, GPs provide an elegant, ﬂexible and simple means of probabilistic inference and are well overdue for consideration of the NLP community.
This tutorial will focus primarily on regression but extensions to other settings will be shown as well. Within NLP, linear models are near ubiquitous, because they provide good results for many tasks, support efficient inference (including dynamic programming in structured prediction) and support simple parameter interpretation. However, linear models are inherently limited in the types of relationships between variables they can model. Often non-linear methods are required for better understanding and improved performance. Currently, kernel methods such as Support Vector Machines (SVM) represent a popular choice for non-linear modelling. These suffer from lack of interoperability with down-stream processing as part of a larger model, and inﬂexibility in terms of parameterisation and associated high cost of hyperparameter optimisation. GPs appear similar to SVMs, in that they incorporate kernels, however their probabilistic formulation allows for much wider applicability in larger graphical models. Moreover, several properties of Gaussian distributions (closure under integration and Gaussian-Gaussian conjugacy) means that GP (regression) supports analytic formulations for the posterior and predictive inference.
1. GP Fundamentals (60 mins)
– From Gaussian Distributions to Processes
– The Squared Exponential Kernel
– Model Selection
2. NLP Applications (50 mins)
– Temporal Patterns in Word Frequencies
– Machine Translation Quality Estimation
– Emotion Analysis
3. Further Topics (50 mins)
– Structural Kernels
Daniel Beck is a ﬁnal year PhD student at the University of Sheffield. His research ﬁeld lies at the intersection between Natural Language Processing and Machine Learning, where he specializes in Bayesian methods. He has published a number of papers on applying Gaussian Processes for text regression tasks and he is particularly interested in structural kernels and uncertainty propagation. He also is an active member of the NLP community, being a reviewer for top-rated conferences such as EMNLP and ACL and also has experience in teaching at NLP-related summer schools.