Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Disentangling low-dimensional vector space representations of text documents

Ager, Thomas 2021. Disentangling low-dimensional vector space representations of text documents. PhD Thesis, Cardiff University.
Item availability restricted.

[thumbnail of PhD, Thesis] PDF (PhD, Thesis) - Accepted Post-Print Version
Available under License Creative Commons Attribution No Derivatives.

Download (940kB)
[thumbnail of Cardiff University Electronic Publication Form] PDF (Cardiff University Electronic Publication Form) - Supplemental Material
Restricted to Repository staff only

Download (1MB)

Abstract

In contrast to traditional document representations such as bags-of-words, the kind of vector space representations that are currently most popular tend to be lower-dimensional. This has important advantages, e.g. making the representation of a given document less dependent on the exact words that are used. However, this also comes at an important cost, namely that the features of the representation are entangled, i.e. each feature is not individually meaningful. The main aim of this thesis is to address this problem by disentangling vector spaces into representations that are composed of meaningful features that are closely aligned with natural categories from the given domain. For instance, in the domain of IMDB movie reviews, where each document is a review, a disentangled feature representation would be separated into features that describe how ("Scary", "Romantic", ..., "Comedic") the movie is. This thesis builds on an initial approach introduced by Derrac and Schockaert [21], which derives features from low-dimensional vector spaces. The method begins by using a linear classifier to find a hyper-plane that separates documents that have a term from those that do not have a term. Then, from each hyperplane, the direction of the orthogonal vector is taken to induce a ranking from documents that are least related to the word (those furthest from the hyper-plane on the negative side) to documents that are most related to it (those furthest from the hyperplane on the positive side). To identify which of these words describe semantically important features, they are scored by how well the linear classifier performs on a standard classification metric, which approximates how linearly separable the documents are that contain the term in the vector space. The assumption is that the more separable a term is, the better modelled it is in the space. The highest scoring terms are selected to be used as features, and documents are ranked by calculating the dot product between the orthogonal vector to the hyper-plane and each document vector. This results in a ranking of documents on how strongly expressed each feature is, e.g. movies could be ranked on how "Scary" they are. Only the direction of this orthogonal vector is considered in this work, as our concern is to obtain document rankings. The work by Derrac and Schockaert [21] obtained semantic features from Multi-Dimensional Scaling (MDS) document embeddings and validated their work by classifying documents using a rule-based classifier (FOIL), resulting in rules of the form "IF x is more scary than most horror films THEN x is a horror film." This work by Derrac and Schockaert [21] was focused on showing the feasibility of learning disentangled representations, but it did not make clear which components of their method were essential. The first main contribution of this thesis therefore consists in a thorough investigation of variants of their method, where a quantitative analysis is conducted of different document representations (as opposed to only MDS), different term scoring functions (as opposed to only the Kappa score) and the proposed clustering method is revisited. This extensive evaluation is across a variety of new domains, and compares the method to stronger baselines. To quantitatively analyse the impact of these design choices, the use of low-depth decision trees that classify natural categories in the domain is proposed. A qualitative analysis of the discovered features is also presented. Neural network architectures have advanced to state-of-the-art in many tasks. The second main contribution of the thesis follows the idea that the hidden layers of a neural network can be viewed as vector space embeddings. Specifically, in our setting, meaningful feature to describe documents can be derived from the hidden layers of neural networks. In particular, to test the potential of using neural networks to discover features that cannot be discovered using standard document embedding methods, feed-forward neural networks and stacked auto-encoders are quantitatively and qualitatively investigated. Auto-encoders are stacked by using their hidden layer as the input to another auto-encoder. We find that meaningful features can indeed be derived from the hidden layers of the considered neural network architectures. We quantitatively assess how predictive these features are, compared to those of the input embeddings. Qualitatively, we find that feedforward networks tend to select and refine features that were already modelled in the input embedding. In contrast, stacked autoencoders tend to model increasingly more abstract features as additional hidden layers are added. For example, in the initial auto-encoder layers, features like "Horror" and "Comedy" can be separated well by the linear classifier. Meanwhile, features like "Society" and "Relationships" are more separable in later layers. After identifying directions that model important features of documents in each stacked auto-encoder, symbolic rules are induced that relate specific features to more general ones. These rules can clarify the nature of the transformations that are learned by the neural network, for example: IF Emotions AND Journey THEN Adventure (1) The third contribution of this thesis is the introduction of an additional post-processing step for improving disentangled feature representations. This is done by fine-tuning the original embedding such that rankings of documents induced by our disentangled features are in agreement with rankings induced by Pointwise Mutual Information scores. The motivation for this contribution stems from the fact that methods for learning document embeddings are mostly aimed at modelling similarity. It is found that there is an inherent trade-off between capturing similarity and faithfully modelling features as directions. Following this observation, a simple method to fine-tune document embeddings is proposed, with the aim of improving the quality of the feature directions obtained from them. This method is also unsupervised, requiring only a bag-of-words representation of the documents as input. In particular, clusters of terms are identified that refer to semantically meaningful and important features of the considered domain, and a simple neural network model is used to learn a new representation in which each of these features is more faithfully modelled as a direction. It is found that in most cases this method improves the ranking of documents, and results in increased performance when disentangled feature representations are used as input to classifiers. Overall, this thesis quantitatively and qualitatively confirms that disentangled feature representations of meaningful features can be derived from low-dimensional vector spaces of documents, across a variety of domains and document embedding models.

Item Type: Thesis (PhD)
Date Type: Completion
Status: Unpublished
Schools: Computer Science & Informatics
Subjects: Q Science > Q Science (General)
T Technology > T Technology (General)
Date of First Compliant Deposit: 5 August 2021
Date of Acceptance: 19 July 2021
Last Modified: 03 Aug 2022 01:38
URI: https://orca.cardiff.ac.uk/id/eprint/143148

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics