Unsupervised learning is useful when a dataset contains only features and no measurement of the outcome, or when we want to extract information independent from the outcome. Instead of predicting future outcomes, the goal is to learn an informative representation of the data that is useful for solving another task, including the exploration of a data set. Examples include the identification of topics to summarize documents (see Chapter 14, the reduction of the number of features to lower the risk of overfitting and computational cost for supervised learning, or to group similar observations as illustrated by the use of clustering for asset allocation at the end of this chapter.
Dimensionality reduction and clustering are the main tasks for unsupervised learning: - Dimensionality reduction transforms the existing features into a new, smaller set while minimizing the loss of information. A broad range of algorithms exists that differ by how they measure the loss of information, whether they apply linear or non-linear transformations or the constraints they impose on the new feature set. - Clustering algorithms identify and group similar observations or features instead of identifying new features. Algorithms differ in how they define the similarity of observations and their assumptions about the resulting groups.
More specifically, this chapter covers: - How principal and independent component analysis (PCA and ICA) perform linear dimensionality reduction - Identifying data-driven risk factors and eigenportfolios from asset returns using PCA - Effectively visualizing nonlinear, high-dimensional data using manifold learning - Using T-SNE and UMAP to explore high-dimensional image data - How k-means, hierarchical, and density-based clustering algorithms work - Using agglomerative clustering to build robust portfolios with hierarchical risk parity
The number of dimensions of a dataset matter because each new dimension can add signal concerning an outcome. However, there is also a downside known as the curse of dimensionality: as the number of independent features grows while the number of observations remains constant, the average distance between data points also grows, and the density of the feature space drops exponentially. The implications for machine learning are dramatic because prediction becomes much harder when observations are more distant, i.e., different from each other.
The notebook curse_of_dimensionality simulates how the average and minimum distances between data points increase as the number of dimensions grows.
Linear dimensionality reduction algorithms compute linear combinations that translate, rotate, and rescale the original features to capture significant variation in the data, subject to constraints on the characteristics of the new features.
This section introduces these two algorithms and then illustrates how to apply PCA to asset returns to learn risk factors from the data, and to build so-called eigen portfolios for systematic trading strategies.
PCA finds principal components as linear combinations of the existing features and uses these components to represent the original data. The number of components is a hyperparameter that determines the target dimensionality and needs to be equal to or smaller than the number of observations or columns, whichever is smaller.
The notebook pca_key_ideas visualizes principal components in 2D and 3D.
PCA aims to capture most of the variance in the data, to make it easy to recover the original features, and that each component adds information. It reduces dimensionality by projecting the original data into the principal component space. PCA makes several assumptions that are important to keep in mind. These include: - high variance implies a high signal-to-noise ratio - the data is standardized so that the variance is comparable across features - linear transformations capture the relevant aspects of the data, and - higher-order statistics beyond the first and second moment do not matter, which implies that the data has a normal distribution
The emphasis on the first and second moments align with standard risk/return metrics, but the normality assumption may conflict with the characteristics of market data.
The notebook the_math_behind_pca illustrate the computation of principal components.
PCA is useful for algorithmic trading in several respects, including the data-driven derivation of risk factors by applying PCA to asset returns, and the construction of uncorrelated portfolios based on the principal components of the correlation matrix of asset returns.
In Chapter 07 - Linear Models, we explored risk factor models used in quantitative finance to capture the main drivers of returns. These models explain differences in returns on assets based on their exposure to systematic risk factors and the rewards associated with these factors.
In particular, we explored the Fama-French approach that specifies factors based on prior knowledge about the empirical behavior of average returns, treats these factors as observable, and then estimates risk model coefficients using linear regression. An alternative approach treats risk factors as latent variables and uses factor analytic techniques like PCA to simultaneously estimate the factors and how the drive returns from historical returns.
Another application of PCA involves the covariance matrix of the normalized returns. The principal components of the correlation matrix capture most of the covariation among assets in descending order and are mutually uncorrelated. Moreover, we can use standardized principal components as portfolio weights.
The notebook pca_and_eigen_portfolios illustrates how to create Eigenportfolios.
Independent component analysis (ICA) is another linear algorithm that identifies a new basis to represent the original data but pursues a different objective than PCA. See Hyvärinen and Oja (2000) for a detailed introduction.
ICA emerged in signal processing, and the problem it aims to solve is called blind source separation. It is typically framed as the cocktail party problem where a given number of guests are speaking at the same time so that a single microphone would record overlapping signals. ICA assumes there are as many different microphones as there are speakers, each placed at different locations so that it records a different mix of the signals. ICA then aims to recover the individual signals from the different recordings.
The manifold hypothesis emphasizes that high-dimensional data often lies on or near a lower-dimensional non-linear manifold that is embedded in the higher dimensional space.
Manifold learning aims to find the manifold of intrinsic dimensionality and then represent the data in this subspace. A simplified example uses a road as one-dimensional manifolds in a three-dimensional space and identifies data points using house numbers as local coordinates.
The notebook manifold_learning_intro contains several exampoles, including the two-dimensional swiss roll that illustrates the topological structure of manifolds.
Several techniques approximate a lower dimensional manifold. One example is locally-linear embedding (LLE) that was developed in 2000 by Sam Roweis and Lawrence Saul.
The generic examples use the following datasets:
t-SNE is an award-winning algorithm developed in 2010 by Laurens van der Maaten and Geoff Hinton to detect patterns in high-dimensional data. It takes a probabilistic, non-linear approach to locating data on several different, but related low-dimensional manifolds. The algorithm emphasizes keeping similar points together in low dimensions, as opposed to maintaining the distance between points that are apart in high dimensions, which results from algorithms like PCA that minimize squared distances.
UMAP) is a more recent algorithm for visualization and general dimensionality reduction. It assumes the data is uniformly distributed on a locally connected manifold and looks for the closest low-dimensional equivalent using fuzzy topology. It uses a neighbors parameter that impacts the result similarly as perplexity above.
It is faster and hence scales better to large datasets than t-SNE, and sometimes preserves global structure than better than t-SNE. It can also work with different distance functions, including, e.g., cosine similarity that is used to measure the distance between word count vectors.
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, Leland McInnes, John Healy, 2018
Both clustering and dimensionality reduction summarize the data. Dimensionality reduction compresses the data by representing it using new, fewer features that capture the most relevant information. Clustering algorithms, in contrast, assign existing observations to subgroups that consist of similar data points.
Clustering can serve to better understand the data through the lens of categories learned from continuous variables. It also permits automatically categorizing new objects according to the learned criteria. Examples of related applications include hierarchical taxonomies, medical diagnostics, or customer segmentation. Alternatively, clusters can be used to represent groups as prototypes, using e.g. the midpoint of a cluster as the best representatives of learned grouping. An example application includes image compression.
Clustering algorithms differ with respect to their strategy of identifying groupings: - Combinatorial algorithms select the most coherent of different groupings of observations - Probabilistic modeling estimates distributions that most likely generated the clusters - Hierarchical clustering finds a sequence of nested clusters that optimizes coherence at any given stage
Algorithms also differ by the notion of what constitutes a useful collection of objects that needs to match the data characteristics, domain and the goal of the applications. Types of groupings include: - Clearly separated groups of various shapes - Prototype- or center-based, compact clusters - Density-based clusters of arbitrary shape - Connectivity- or graph-based clusters
Important additional aspects of a clustering algorithm include whether - it requires exclusive cluster membership, - makes hard, i.e., binary, or soft, probabilistic assignment, and - is complete and assigns all data points to clusters.
The notebook clustering_algos compares the clustering results for several algorithm using toy dataset designed to test clustering algorithms.
k-Means is the most well-known clustering algorithm and was first proposed by Stuart Lloyd at Bell Labs in 1957.
The algorithm finds K centroids and assigns each data point to exactly one cluster with the goal of minimizing the within-cluster variance (called inertia). It typically uses Euclidean distance but other metrics can also be used. k-Means assumes that clusters are spherical and of equal size and ignores the covariance among features.
Cluster quality metrics help select among alternative clustering results.
Hierarchical clustering avoids the need to specify a target number of clusters because it assumes that data can successively be merged into increasingly dissimilar clusters. It does not pursue a global objective but decides incrementally how to produce a sequence of nested clusters that range from a single cluster to clusters consisting of the individual data points.
While hierarchical clustering does not have hyperparameters like k-Means, the measure of dissimilarity between clusters (as opposed to individual data points) has an important impact on the clustering result. The options differ as follows:
The notebook hierarchical_clustering demonstrates how this algorithm works, and how to visualize and evaluate the results.
Density-based clustering algorithms assign cluster membership based on proximity to other cluster members. They pursue the goal of identifying dense regions of arbitrary shapes and sizes. They do not require the specification of a certain number of clusters but instead rely on parameters that define the size of a neighborhood and a density threshold.
The notebook density_based_clustering demonstrates how DBSCAN and hierarchical DBSCAN work.
Gaussian mixture models (GMM) are a generative model that assumes the data has been generated by a mix of various multivariate normal distributions. The algorithm aims to estimate the mean & covariance matrices of these distributions.
It generalizes the k-Means algorithm: it adds covariance among features so that clusters can be ellipsoids rather than spheres, while the centroids are represented by the means of each distribution. The GMM algorithm performs soft assignments because each point has a probability to be a member of any cluster.
The notebook gaussian_mixture_models demonstrates the application of a GMM clustering model.
The key idea of hierarchical risk parity (HRP) is to use hierarchical clustering on the covariance matrix to be able to group assets with similar correlations together and reduce the number of degrees of freedom by only considering 'similar' assets as substitutes when constructing the portfolio.