Vector Spaces, Part 1

Vector Spaces: Part I, "Graphical Representations and Intuition"

Note: this post is a slightly modified version of the IPython notebook I originally created for my team's weekly teaching+learning sessions. If you want, you can see the original version of this notebook, or check out all the rest of our content.

Much of statistical learning relies on the ability to represent data observations (measurements) in a space comprising relevant dimensions (features). Often, the number of relevant dimensions is quite small; if you were trying to discern a model that described the area of a rectangle, observations of only two features (length and width) would be all you needed. Fisher's well-known iris dataset comprises 150 measurements of only three features 📊

In some cases - particularly with text analysis - the dimensionality of the space can grow much faster. In many approaches to text analysis, the process to get from a text corpus to numerical feature vectors involves a few steps. Just as an exapmle, one way to do this is to:

  1. break the corpus into documents e.g. each on a new line of an input file
  2. parse the document into tokens e.g. split words on whitespace
  3. construct a feature vector for each document

One way to accomplish the final step is to consider each token (ie word) as a unique dimension, and the count of each word per document as the magnitude along the corresponding dimension. There are certainly other ways to define each of these steps (and more subtle details to consider within each), but for now, we'll consider this simple one.

Using exactly this approach, constructing a vector space from just a few minutes of Tweets (each Tweet considered a document) leads to a space with hundreds of thousands of features! In this high-dimensional vector space, it becomes easy for us to be misled by our intuition for statistical learning approaches in more "human" dimensions e.g. one-, two- and three-dimensional spaces. At this point, many people will cite the "curse of dimensionality."

There are multiple phenomena referred to by this name in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining, and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

This "curse of dimensionality" refers to a few related, but distinct challenges with data for statistical learning in high dimensions:

  • "increasing dimensions decrease the power of a test statistic" ref
  • "Intuition fails us in high dimensions" ref
  • sparse data is increasingly found in the corners/shell of high-dimensional space (this notebook!)

I wanted to build more intuition around thinking, visualizing, and generally being more aware of how these phenomena affect our typical kinds of analyses; this notebook is a first step, primarily focused on building an intuition for inspecting and thinking about ways to inspect spaces when we can longer just plot them.

Along the way, I learned a number of new things, and aim to explore them in follow up pieces.

Note: Beware that there are a lot of reused variable names in this notebook. If you get an unexpected result, or an error, be sure to check that the appropriate data generation step was run!

Take-aways

These are the two high-level objectives we'll aim for:

  • concepts for inspecting data in high dimensions
  • illustrations of how high dimensionality squeezes data "density profile" to edges
In [1]:
import copy
try:
    import ujson as json
except ImportError:
    import json    
import math
import operator
import random

from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from numpy.linalg import norm as np_norm
import matplotlib.pyplot as plt
import pandas as pd
from scipy.spatial import distance as spd
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

sns.set_style('whitegrid')
%matplotlib inline

Simple, visualizable spaces

We'll start by exploring some approaches to thinking about and inspecting data in spaces that we can comprehend without much effort.

Normal distributions in 2D

In [2]:
# number of data points
n = 1000

# array of continuous values, randomly drawn from standard normal in two dimensions
X = np.array(np.random.normal(size=(n,2)))

# seaborn plays really nicely with pandas
df = pd.DataFrame(X, columns=['x0','x1'])

df.tail()
Out[2]:
x0 x1
995 -0.160302 1.037684
996 -0.559322 -0.942932
997 -0.978456 0.062490
998 -0.792962 -0.154146
999 -1.195318 1.513889

We have a 2-dimensional feature space containing 1000 pieces of data. Each coordinate is orthogonal, and we can equivalently think about each data point being represented by a vector from the origin [ (0,0) in 2-dimensional space ], to the point defined by [x0, x1].

Since we only have two dimensions, we can look at the bivariate distribution quite easily using jointplot. Seaborn also gives us some handy tools for looking at the univariate distributions at the same time 🙌🏼

In [3]:
sns.jointplot(data=df, x='x0', y='x1', alpha=0.5)
Out[3]:
<seaborn.axisgrid.JointGrid at 0x10e092080>

Another distribution that can provide some hints about the structure of data in a multi-dimensional vector space, is the pairwise inter-point distance distribution for all points in the data. Here's a function that makes this a little cleaner.

In [4]:
def in_sample_dist(X, n):
    """Create a histogram of pairwise distances in array X, using n bins."""
    plt.figure(figsize=(15,6))

    # use scipy's pairwise distance function for efficiency
    plt.hist(spd.pdist(X), bins=n, alpha=0.6)

    plt.xlabel('inter-sample distance')
    plt.ylabel('count')    
In [5]:
in_sample_dist(X,n)

In unsupervised statistical learning, we're often interested in the existence of "clusters" in data. Our intuition in low dimensions can be helpful here. In order to identify and label a grouping of points as being unique from some other grouping of points, there needs to be a similarity or "sameness" metric that we can compare. One such measure is simply the distance between all of the points. If a group of points are all qualitatively closer to each other than another group of points, then we might call those two groups unique clusters.

If we look at the distribution of inter-point distances above, we see a relatively smooth distribution, suggesting that no group of points is notably closer or further than another other group of points. We'll come back to this idea, shortly. (The inspiration for this approach is found here: pdf)

Above, the bivariate pairplot works great for displaying our data when it's in two dimensions, but you can probably imagine that even in just d=3 dimensions, looking at this distribution of data will be really hard. So, I want to create a metric that gives us a feel for where the data is located in the vector space. There are many ways to do this. For now, I'm going to consider the euclidean distance cumulative distribution function*. Remember that the euclidean distance is the $L_{2}$ norm $dist(p,q) = \sqrt{ \sum_{i=1}^{d} (q_{i}-p_{i})^{2} }$ where d is the dimensionality of the space. (Wiki)

*in fact, even in the course of developing this notebook, I learned that this is not a terribly great choice. But, hey, you have to start somewhere! ¯\_(ツ)_/¯

In [6]:
def radius(vector):
    """Calculate the euclidean norm for the given coordinate vector."""    
    origin = np.zeros(len(vector))
    # use scipy's distance functions again! 
    return spd.euclidean(origin, vector)
In [7]:
# use our function to create a new 'r' column in the dataframe
df['r'] = df.apply(radius, axis=1)

df.head()
Out[7]:
x0 x1 r
0 -0.176994 -0.359991 0.401149
1 -0.556111 -0.652158 0.857070
2 0.792821 0.303477 0.848919
3 -0.404705 -0.335644 0.525778
4 0.251837 -0.479817 0.541892

There are a couple of ways that I want to visualize this radial distance. First, I'd like to see the univariate distribution (from 0 to max(r)), and second, I'd like to see how much of the data is at a radius less than or equal to a particular value of r. To do this, I'll define a plotting function that takes a dataframe as shown above, and returns plots of these two distributions as described.

There's a lot of plotting hijinks in this function, so first just look at the output and see if it makes some sense. Then we can come back and dig through the plotting function.

In [8]:
def kde_cdf_plot(df, norm=False, vol=False):
    """Display stacked KDE and CDF plots."""
    
    assert 'r' in df, 'This method only works for dataframes that include a radial distance in an "r" column!'
    
    if norm:
        # overwrite df.r with normalized version
        df['r'] = df['r'] / max(df['r'])
        
    fig, (ax1, ax2) = plt.subplots(2,1, 
                                   sharex=True, 
                                   figsize=(15,8)
                                  )
    # subplot 1
    sns.distplot(df['r'], 
                 hist=False, 
                 rug=True, 
                 ax=ax1
                )
    ax1.set_ylabel('KDE')
    ax1.set_title('n={} in {}-d space'.format(len(df), df.shape[1] - 1) )

    # subplot 2
    if vol:
        raise NotImplementedError("Didn't finish implementing this volume normalization!")
        dim = df.shape[1] - 1
        df['r'].apply(lambda x: x**dim).plot(kind='hist', 
                                               cumulative=True, 
                                               normed=1, 
                                               bins=len(df['r']), 
                                               histtype='step', 
                                               linewidth=2,
                                               ax=ax2
                                              )

        ax2.set_ylabel('CDF')
        plt.xlim(0, .99*max(df['r'])**dim)        
        xlab = 'volume fraction'        
    else:
        df['r'].plot(kind='hist', 
                       cumulative=True, 
                       normed=1, 
                       bins=len(df['r']), 
                       histtype='step', 
                       linewidth=2,
                       ax=ax2
                      )

        ax2.set_ylabel('CDF')
        plt.xlim(0, .99*max(df['r']))
        
        xlab = 'radial distance'
    if norm:
        xlab += ' (%)'
    plt.xlabel(xlab)

Now, let's see these distributions for the 2-dimensional array we created earlier.

In [9]:
kde_cdf_plot(df)
/Users/jmontague/.virtualenvs/py34-data/lib/python3.4/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

As a reminder:

  • the kernel density estimate (KDE) is a nice visualization of the "density profile", created by assuming there exists a standard normal at each data point, summing all of these curves, and then normalizing the total under-curve area to 1. The seaborn docs have a nice illustration of this technique. The ticks on the bottom are called a "rug plot", and are the values of the data (values of r)
  • the cumulative distribution function (CDF) is a measure of the fraction of values which have a value equal to, or lesser than, the specified value, $CDF_{X}(x)=P(X \le x)$. For the purpose of this session, I want to use this particular to highlight where the observed data is, relative to the "radius" of the entire space. The value of the CDF is the fraction of data contained at an equal or lesser "radius" value (in d dimensions).

Blobs in 2D

Let's add a bit of complexity to the examples above by making the data slightly more irregular: we'll use sklearn's blob constructor.

In [10]:
# data points, dimensions, blob count
n = 1000
dims = 2
blobs = 5

# note: default bounding space is +/- 10.0 in each dimension
X, y = make_blobs(n_samples=n, n_features=dims, centers=blobs)
In [11]:
# convert np arrays to a df, auto-label the columns
X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])

X_df.head()
Out[11]:
x0 x1
0 6.236402 5.908866
1 6.804768 7.235599
2 3.502532 6.679645
3 8.549006 6.194495
4 -5.272677 -8.104100
In [12]:
sns.jointplot(data=X_df, x='x0', y='x1')
Out[12]:
<seaborn.axisgrid.JointGrid at 0x113e005f8>
In [13]:
X_df['r'] = X_df.apply(radius, axis=1)

#X_df.head()

This time, we'll incorporate one extra kwarg in the kde_cdf_plot function: norm=True displays the x axis (radial distance) as a fraction of the maximum value. This will helpful when we're comparing spaces of varying radial magnitude.

In [14]:
kde_cdf_plot(X_df, norm=True)

As a start, notice that the radius CDF for this data has shifted to the right. At larger r, we're closer to the "edge" of the space containing our data. The graph will vary with iterations of the data generation, but should consistently be shifted to the right relative to the 0-centered standard normal distribution.

Now let's look at the inter-sample distance distribution. Remember that this data is explicitly generated by a mechanism that includes clusters, so we should not see a nice uniform distribution.

In [15]:
in_sample_dist(X,n)

Sure enough, we can see that there are in fact some peaks in the inter-sample distance. This makes sense, because we know that the data generation process encoded that exact idea. Since we're intentionally using a data generation process that builds in clusters, we'll always see a peak on the low end of the x axis... each cluster is created with a low (and similar) intra-cluster distance. The other, larger peaks, will illustrate the relationships between the clusters.

We may not see precisely the same number of peaks as were specified in the blob creation, though, because we know that sometimes the blobs will be on top of each other and will "look" like one cluster. Compare the peaks of this distribution with the pairplot we created with the same data.

Blobs in 3D

Let's increase the dimension count by one, to 3, just about the limit of our intuition's abilities. To make the data generation process a bit more reusable, we'll use a function to get the data array and corresponding dataframes.

In [16]:
def make_blob_df(n_points=1000, dims=2, blobs=5, bounding_box=(-10.0, 10.0)):
    """Function to automate the np.array blob => pd.df creation and r calculation."""
    # nb: default bounding space is +/- 10.0 in each dimension
    X, y = make_blobs(n_samples=n_points, n_features=dims, centers=blobs, center_box=bounding_box)

    # make a df, auto-label the columns
    X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
    X_df_no_r = copy.deepcopy(X_df)
    
    # add a radial distance column
    X_df['r'] = X_df.apply(radius, axis=1)

    return X, X_df, X_df_no_r, y
In [17]:
n = 1000
dims = 3
blobs = 5


X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)

X_df.head()
#X_df_no_r.head()
Out[17]:
x0 x1 x2 r
0 -4.772849 -1.000716 9.158024 10.375496
1 -5.404354 -2.755771 9.526040 11.293660
2 3.012065 -3.795386 2.143007 5.298110
3 -3.757271 3.403764 2.968760 5.875051
4 6.073176 2.040680 8.669477 10.779966
In [18]:
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111, projection='3d')

ax.plot(X_df['x0'],X_df['x1'],X_df['x2'],'o', alpha=0.3)

ax.set_xlabel('x0'); ax.set_ylabel('x1'); ax.set_zlabel('x2')
Out[18]:
<matplotlib.text.Text at 0x114722320>
In [19]:
sns.pairplot(X_df_no_r, plot_kws=dict(alpha=0.3), diag_kind='kde')
Out[19]:
<seaborn.axisgrid.PairGrid at 0x113846710>
In [20]:
kde_cdf_plot(X_df, norm=True)

Again, compare this CDF to the 2-d case above; note that the data is closer to the "edge" of the space.

In [21]:
in_sample_dist(X,n)

Higher-dimensional blobs

Ok, let's jump out of the space where we can easily visualize the data. Let's now go to d=10. While we can still look at pairwise coordinate locations, we can't see the whole space at once anymore. Now we'll rely on our other plots for intuition of the space profile.

In [22]:
n = 1000
dims = 10
blobs = 5


X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)

X_df.head()
Out[22]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 r
0 -0.876305 8.495646 9.679137 -5.570551 4.772455 -8.706477 -6.697756 -9.768288 0.888418 -3.829635 21.259692
1 -5.481066 7.093798 -4.012029 7.700416 -3.293914 -4.942807 -4.935820 10.443387 5.458501 -5.500957 19.609193
2 6.070724 -6.880575 6.897200 -5.783782 1.647554 4.837057 7.372285 -9.067051 -8.047290 0.264811 19.817217
3 0.573545 7.859031 5.850043 -9.907207 -10.221014 -9.773433 -8.182964 -7.617599 -4.271119 8.249641 24.611907
4 6.680778 1.719337 6.045810 -9.378527 8.525425 -7.546026 -5.812690 -6.862207 1.784192 10.621219 22.329213
In [23]:
# this starts to take a few seconds when d~10
sns.pairplot(X_df_no_r, diag_kind='kde', plot_kws=dict(alpha=0.3))
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1132e42e8>