Vector Spaces, Part 1

Vector Spaces: Part I, "Graphical Representations and Intuition"

Note: this post is a slightly modified version of the IPython notebook I originally created for my team's weekly teaching+learning sessions. If you want, you can see the original version of this notebook, or check out all the rest of our content.

Much of statistical learning relies on the ability to represent data observations (measurements) in a space comprising relevant dimensions (features). Often, the number of relevant dimensions is quite small; if you were trying to discern a model that described the area of a rectangle, observations of only two features (length and width) would be all you needed. Fisher's well-known iris dataset comprises 150 measurements of only three features 📊

In some cases - particularly with text analysis - the dimensionality of the space can grow much faster. In many approaches to text analysis, the process to get from a text corpus to numerical feature vectors involves a few steps. Just as an exapmle, one way to do this is to:

  1. break the corpus into documents e.g. each on a new line of an input file
  2. parse the document into tokens e.g. split words on whitespace
  3. construct a feature vector for each document

One way to accomplish the final step is to consider each token (ie word) as a unique dimension, and the count of each word per document as the magnitude along the corresponding dimension. There are certainly other ways to define each of these steps (and more subtle details to consider within each), but for now, we'll consider this simple one.

Using exactly this approach, constructing a vector space from just a few minutes of Tweets (each Tweet considered a document) leads to a space with hundreds of thousands of features! In this high-dimensional vector space, it becomes easy for us to be misled by our intuition for statistical learning approaches in more "human" dimensions e.g. one-, two- and three-dimensional spaces. At this point, many people will cite the "curse of dimensionality."

There are multiple phenomena referred to by this name in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining, and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

This "curse of dimensionality" refers to a few related, but distinct challenges with data for statistical learning in high dimensions:

  • "increasing dimensions decrease the power of a test statistic" ref
  • "Intuition fails us in high dimensions" ref
  • sparse data is increasingly found in the corners/shell of high-dimensional space (this notebook!)

I wanted to build more intuition around thinking, visualizing, and generally being more aware of how these phenomena affect our typical kinds of analyses; this notebook is a first step, primarily focused on building an intuition for inspecting and thinking about ways to inspect spaces when we can longer just plot them.

Along the way, I learned a number of new things, and aim to explore them in follow up pieces.

Note: Beware that there are a lot of reused variable names in this notebook. If you get an unexpected result, or an error, be sure to check that the appropriate data generation step was run!

Take-aways

These are the two high-level objectives we'll aim for:

  • concepts for inspecting data in high dimensions
  • illustrations of how high dimensionality squeezes data "density profile" to edges
In [1]:
import copy
try:
    import ujson as json
except ImportError:
    import json    
import math
import operator
import random

from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from numpy.linalg import norm as np_norm
import matplotlib.pyplot as plt
import pandas as pd
from scipy.spatial import distance as spd
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

sns.set_style('whitegrid')
%matplotlib inline

Simple, visualizable spaces

We'll start by exploring some approaches to thinking about and inspecting data in spaces that we can comprehend without much effort.

Normal distributions in 2D

In [2]:
# number of data points
n = 1000

# array of continuous values, randomly drawn from standard normal in two dimensions
X = np.array(np.random.normal(size=(n,2)))

# seaborn plays really nicely with pandas
df = pd.DataFrame(X, columns=['x0','x1'])

df.tail()
Out[2]:
x0 x1
995 -0.160302 1.037684
996 -0.559322 -0.942932
997 -0.978456 0.062490
998 -0.792962 -0.154146
999 -1.195318 1.513889

We have a 2-dimensional feature space containing 1000 pieces of data. Each coordinate is orthogonal, and we can equivalently think about each data point being represented by a vector from the origin [ (0,0) in 2-dimensional space ], to the point defined by [x0, x1].

Since we only have two dimensions, we can look at the bivariate distribution quite easily using jointplot. Seaborn also gives us some handy tools for looking at the univariate distributions at the same time 🙌🏼

In [3]:
sns.jointplot(data=df, x='x0', y='x1', alpha=0.5)
Out[3]:
<seaborn.axisgrid.JointGrid at 0x10e092080>

Another distribution that can provide some hints about the structure of data in a multi-dimensional vector space, is the pairwise inter-point distance distribution for all points in the data. Here's a function that makes this a little cleaner.

In [4]:
def in_sample_dist(X, n):
    """Create a histogram of pairwise distances in array X, using n bins."""
    plt.figure(figsize=(15,6))

    # use scipy's pairwise distance function for efficiency
    plt.hist(spd.pdist(X), bins=n, alpha=0.6)

    plt.xlabel('inter-sample distance')
    plt.ylabel('count')    
In [5]:
in_sample_dist(X,n)

In unsupervised statistical learning, we're often interested in the existence of "clusters" in data. Our intuition in low dimensions can be helpful here. In order to identify and label a grouping of points as being unique from some other grouping of points, there needs to be a similarity or "sameness" metric that we can compare. One such measure is simply the distance between all of the points. If a group of points are all qualitatively closer to each other than another group of points, then we might call those two groups unique clusters.

If we look at the distribution of inter-point distances above, we see a relatively smooth distribution, suggesting that no group of points is notably closer or further than another other group of points. We'll come back to this idea, shortly. (The inspiration for this approach is found here: pdf)

Above, the bivariate pairplot works great for displaying our data when it's in two dimensions, but you can probably imagine that even in just d=3 dimensions, looking at this distribution of data will be really hard. So, I want to create a metric that gives us a feel for where the data is located in the vector space. There are many ways to do this. For now, I'm going to consider the euclidean distance cumulative distribution function*. Remember that the euclidean distance is the $L_{2}$ norm $dist(p,q) = \sqrt{ \sum_{i=1}^{d} (q_{i}-p_{i})^{2} }$ where d is the dimensionality of the space. (Wiki)

*in fact, even in the course of developing this notebook, I learned that this is not a terribly great choice. But, hey, you have to start somewhere! ¯\_(ツ)_/¯

In [6]:
def radius(vector):
    """Calculate the euclidean norm for the given coordinate vector."""    
    origin = np.zeros(len(vector))
    # use scipy's distance functions again! 
    return spd.euclidean(origin, vector)
In [7]:
# use our function to create a new 'r' column in the dataframe
df['r'] = df.apply(radius, axis=1)

df.head()
Out[7]:
x0 x1 r
0 -0.176994 -0.359991 0.401149
1 -0.556111 -0.652158 0.857070
2 0.792821 0.303477 0.848919
3 -0.404705 -0.335644 0.525778
4 0.251837 -0.479817 0.541892

There are a couple of ways that I want to visualize this radial distance. First, I'd like to see the univariate distribution (from 0 to max(r)), and second, I'd like to see how much of the data is at a radius less than or equal to a particular value of r. To do this, I'll define a plotting function that takes a dataframe as shown above, and returns plots of these two distributions as described.

There's a lot of plotting hijinks in this function, so first just look at the output and see if it makes some sense. Then we can come back and dig through the plotting function.

In [8]:
def kde_cdf_plot(df, norm=False, vol=False):
    """Display stacked KDE and CDF plots."""
    
    assert 'r' in df, 'This method only works for dataframes that include a radial distance in an "r" column!'
    
    if norm:
        # overwrite df.r with normalized version
        df['r'] = df['r'] / max(df['r'])
        
    fig, (ax1, ax2) = plt.subplots(2,1, 
                                   sharex=True, 
                                   figsize=(15,8)
                                  )
    # subplot 1
    sns.distplot(df['r'], 
                 hist=False, 
                 rug=True, 
                 ax=ax1
                )
    ax1.set_ylabel('KDE')
    ax1.set_title('n={} in {}-d space'.format(len(df), df.shape[1] - 1) )

    # subplot 2
    if vol:
        raise NotImplementedError("Didn't finish implementing this volume normalization!")
        dim = df.shape[1] - 1
        df['r'].apply(lambda x: x**dim).plot(kind='hist', 
                                               cumulative=True, 
                                               normed=1, 
                                               bins=len(df['r']), 
                                               histtype='step', 
                                               linewidth=2,
                                               ax=ax2
                                              )

        ax2.set_ylabel('CDF')
        plt.xlim(0, .99*max(df['r'])**dim)        
        xlab = 'volume fraction'        
    else:
        df['r'].plot(kind='hist', 
                       cumulative=True, 
                       normed=1, 
                       bins=len(df['r']), 
                       histtype='step', 
                       linewidth=2,
                       ax=ax2
                      )

        ax2.set_ylabel('CDF')
        plt.xlim(0, .99*max(df['r']))
        
        xlab = 'radial distance'
    if norm:
        xlab += ' (%)'
    plt.xlabel(xlab)

Now, let's see these distributions for the 2-dimensional array we created earlier.

In [9]:
kde_cdf_plot(df)
/Users/jmontague/.virtualenvs/py34-data/lib/python3.4/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

As a reminder:

  • the kernel density estimate (KDE) is a nice visualization of the "density profile", created by assuming there exists a standard normal at each data point, summing all of these curves, and then normalizing the total under-curve area to 1. The seaborn docs have a nice illustration of this technique. The ticks on the bottom are called a "rug plot", and are the values of the data (values of r)
  • the cumulative distribution function (CDF) is a measure of the fraction of values which have a value equal to, or lesser than, the specified value, $CDF_{X}(x)=P(X \le x)$. For the purpose of this session, I want to use this particular to highlight where the observed data is, relative to the "radius" of the entire space. The value of the CDF is the fraction of data contained at an equal or lesser "radius" value (in d dimensions).

Blobs in 2D

Let's add a bit of complexity to the examples above by making the data slightly more irregular: we'll use sklearn's blob constructor.

In [10]:
# data points, dimensions, blob count
n = 1000
dims = 2
blobs = 5

# note: default bounding space is +/- 10.0 in each dimension
X, y = make_blobs(n_samples=n, n_features=dims, centers=blobs)
In [11]:
# convert np arrays to a df, auto-label the columns
X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])

X_df.head()
Out[11]:
x0 x1
0 6.236402 5.908866
1 6.804768 7.235599
2 3.502532 6.679645
3 8.549006 6.194495
4 -5.272677 -8.104100
In [12]:
sns.jointplot(data=X_df, x='x0', y='x1')
Out[12]:
<seaborn.axisgrid.JointGrid at 0x113e005f8>
In [13]:
X_df['r'] = X_df.apply(radius, axis=1)

#X_df.head()

This time, we'll incorporate one extra kwarg in the kde_cdf_plot function: norm=True displays the x axis (radial distance) as a fraction of the maximum value. This will helpful when we're comparing spaces of varying radial magnitude.

In [14]:
kde_cdf_plot(X_df, norm=True)

As a start, notice that the radius CDF for this data has shifted to the right. At larger r, we're closer to the "edge" of the space containing our data. The graph will vary with iterations of the data generation, but should consistently be shifted to the right relative to the 0-centered standard normal distribution.

Now let's look at the inter-sample distance distribution. Remember that this data is explicitly generated by a mechanism that includes clusters, so we should not see a nice uniform distribution.

In [15]:
in_sample_dist(X,n)

Sure enough, we can see that there are in fact some peaks in the inter-sample distance. This makes sense, because we know that the data generation process encoded that exact idea. Since we're intentionally using a data generation process that builds in clusters, we'll always see a peak on the low end of the x axis... each cluster is created with a low (and similar) intra-cluster distance. The other, larger peaks, will illustrate the relationships between the clusters.

We may not see precisely the same number of peaks as were specified in the blob creation, though, because we know that sometimes the blobs will be on top of each other and will "look" like one cluster. Compare the peaks of this distribution with the pairplot we created with the same data.

Blobs in 3D

Let's increase the dimension count by one, to 3, just about the limit of our intuition's abilities. To make the data generation process a bit more reusable, we'll use a function to get the data array and corresponding dataframes.

In [16]:
def make_blob_df(n_points=1000, dims=2, blobs=5, bounding_box=(-10.0, 10.0)):
    """Function to automate the np.array blob => pd.df creation and r calculation."""
    # nb: default bounding space is +/- 10.0 in each dimension
    X, y = make_blobs(n_samples=n_points, n_features=dims, centers=blobs, center_box=bounding_box)

    # make a df, auto-label the columns
    X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
    X_df_no_r = copy.deepcopy(X_df)
    
    # add a radial distance column
    X_df['r'] = X_df.apply(radius, axis=1)

    return X, X_df, X_df_no_r, y
In [17]:
n = 1000
dims = 3
blobs = 5


X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)

X_df.head()
#X_df_no_r.head()
Out[17]:
x0 x1 x2 r
0 -4.772849 -1.000716 9.158024 10.375496
1 -5.404354 -2.755771 9.526040 11.293660
2 3.012065 -3.795386 2.143007 5.298110
3 -3.757271 3.403764 2.968760 5.875051
4 6.073176 2.040680 8.669477 10.779966
In [18]:
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111, projection='3d')

ax.plot(X_df['x0'],X_df['x1'],X_df['x2'],'o', alpha=0.3)

ax.set_xlabel('x0'); ax.set_ylabel('x1'); ax.set_zlabel('x2')
Out[18]:
<matplotlib.text.Text at 0x114722320>
In [19]:
sns.pairplot(X_df_no_r, plot_kws=dict(alpha=0.3), diag_kind='kde')
Out[19]:
<seaborn.axisgrid.PairGrid at 0x113846710>
In [20]:
kde_cdf_plot(X_df, norm=True)

Again, compare this CDF to the 2-d case above; note that the data is closer to the "edge" of the space.

In [21]:
in_sample_dist(X,n)

Higher-dimensional blobs

Ok, let's jump out of the space where we can easily visualize the data. Let's now go to d=10. While we can still look at pairwise coordinate locations, we can't see the whole space at once anymore. Now we'll rely on our other plots for intuition of the space profile.

In [22]:
n = 1000
dims = 10
blobs = 5


X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)

X_df.head()
Out[22]:
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 r
0 -0.876305 8.495646 9.679137 -5.570551 4.772455 -8.706477 -6.697756 -9.768288 0.888418 -3.829635 21.259692
1 -5.481066 7.093798 -4.012029 7.700416 -3.293914 -4.942807 -4.935820 10.443387 5.458501 -5.500957 19.609193
2 6.070724 -6.880575 6.897200 -5.783782 1.647554 4.837057 7.372285 -9.067051 -8.047290 0.264811 19.817217
3 0.573545 7.859031 5.850043 -9.907207 -10.221014 -9.773433 -8.182964 -7.617599 -4.271119 8.249641 24.611907
4 6.680778 1.719337 6.045810 -9.378527 8.525425 -7.546026 -5.812690 -6.862207 1.784192 10.621219 22.329213
In [23]:
# this starts to take a few seconds when d~10
sns.pairplot(X_df_no_r, diag_kind='kde', plot_kws=dict(alpha=0.3))
Out[23]:
<seaborn.axisgrid.PairGrid at 0x1132e42e8>
In [24]:
kde_cdf_plot(X_df, norm=True)
In [25]:
in_sample_dist(X,n)

Having seen the way these plots vary individually, let's compare, side-by-side, a similar data generation process (same number of points and clusters) in a range of dimensions.

In [26]:
n_points = 1000
dim_range = [2, 100, 10000]
blob_count = 5


fig, (ax1, ax2) = plt.subplots(2,1, sharex=True, figsize=(15,8))

for d in dim_range:
    ## data generation    
    # random gaussian blobs in d-dims
    X, y = make_blobs(n_samples=n_points, n_features=d, centers=blob_count)
    ## 
    
    ## calculation
    # create a labeled df from X
    X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
    # add an 'r' column
    #X_df_no_r = copy.deepcopy(X_df)
    X_df['r'] = X_df.apply(radius, axis=1)
    # normalize r value to % of max?
    X_df['r'] = X_df['r'] / max(X_df['r'])
    ##
    
    ## plotting
    # subplot 1 - KDE
    sns.distplot(X_df['r'], 
                 kde=True,
                 hist=False, 
                 rug=True, 
                 ax=ax1,
                 label='{}-dims'.format(d)
                )
    
    # subplot 2 - CDF
    X_df['r'].plot(kind='hist', 
                   cumulative=True, 
                   normed=1, 
                   bins=len(X_df['r']), 
                   histtype='step', 
                   linewidth=2,
                   ax=ax2
                  )
    ##
    

ax1.set_ylabel('KDE')
ax1.set_title('n={} in {}-d space'.format(len(X_df), dim_range) )
ax2.set_ylabel('CDF')

plt.xlim(0, .999*max(X_df['r']))    
plt.xlabel('radial distance (%)')
Out[26]:
<matplotlib.text.Text at 0x11a54efd0>
In [27]:
fig, (ax1, ax2, ax3) = plt.subplots(3,1, figsize=(15,9))


for i,d in enumerate(dim_range):
    X, y = make_blobs(n_samples=n_points, n_features=d, centers=blob_count)
    
    # loop through the subplots
    plt.subplot('31{}'.format(i+1))
    # plot the data 
    plt.hist(spd.pdist(X), bins=n_points, alpha=0.6)
    plt.ylabel('count (d={})'.format(d))

ax3.set_xlabel('inter-sample distance')  
Out[27]:
<matplotlib.text.Text at 0x12a9b3240>

This ten-dimensional inter-point histogram does a interesting job of illustrating the sort of "us versus them" nature of data at the edges of a high-dimensional space:

In this diagram (plus or minus the variations introduced by the random blob creation), you can see there is one giant pile of data at low distance ("us", that is, the intra-cluster distances as viewed from within any one cluster), and then approximately one giant pile of data at large distance ("them", that is, everything that's not in your cluster looks equidistantly far away). While I have a suspicion that this can pose challenges from some algorithms, I need to do more homework before making any claims.

Text data

Most of the time, our unsupervised clustering in high dimensions is a function of using text data as an input. We'll start with a small corpus - again, to build intuition about what the data looks like - and then work up.

In [28]:
small_corpus = [
    'The dog likes cats.',
    'The blue cat eats brown sharks.',
    'Why not, blue?'
]
In [29]:
vec = CountVectorizer()

X = vec.fit_transform(small_corpus)

X.todense()
Out[29]:
matrix([[0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0],
        [1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0],
        [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]], dtype=int64)
In [30]:
vec.vocabulary_
Out[30]:
{'blue': 0,
 'brown': 1,
 'cat': 2,
 'cats': 3,
 'dog': 4,
 'eats': 5,
 'likes': 6,
 'not': 7,
 'sharks': 8,
 'the': 9,
 'why': 10}

It's good to remember how to map the matrix-like data onto the words that go into it...

In [31]:
terms = [x for x,_ in sorted(vec.vocabulary_.items(), key=operator.itemgetter(1))]

text_df = pd.DataFrame(X.todense(), columns=terms)

text_df
Out[31]:
blue brown cat cats dog eats likes not sharks the why
0 0 0 0 1 1 0 1 0 0 1 0
1 1 1 1 0 0 1 0 0 1 1 0
2 1 0 0 0 0 0 0 1 0 0 1
In [32]:
text_df['r'] = text_df.apply(radius, axis=1)

text_df
Out[32]:
blue brown cat cats dog eats likes not sharks the why r
0 0 0 0 1 1 0 1 0 0 1 0 2.000000
1 1 1 1 0 0 1 0 0 1 1 0 2.449490
2 1 0 0 0 0 0 0 1 0 0 1 1.732051
In [33]:
kde_cdf_plot(text_df, norm=True)

With a tiny little corpus, these plots aren't very useful. Let's use a bigger one: this text file (not included in the repo, sorry visitors!) is about a 10-minute, 10% sample of Tweet (body text) from the Firehose. It has a little under 400,000 Tweets.

In [34]:
text_array = []

with open('twitter_2016-04-06_2030.jsonl.body.txt', 'r') as infile:
    for line in infile:
        text_array.append(line.replace('\n', ' '))
        
print( len(text_array) )
print( text_array[0] )        
374941
@iamkuds omgg THANK YOU SO MUCH❤️❤️ 
In [35]:
vec = CountVectorizer(
                    #binary=1,
                    ## add dimensionality reduction?
                    #stop_words='english',
                    #lowercase=True,
                    #min_df=10
                    )

dtm = vec.fit_transform(text_array)

dtm
Out[35]:
<374941x523498 sparse matrix of type '<class 'numpy.int64'>'
	with 3051924 stored elements in Compressed Sparse Row format>
In [36]:
# what fraction of the feature space is full?
3051924 / ( 374941*523498 )
Out[36]:
1.5548759791171626e-05

We have to do the radius math slightly differently now, because we're dealing with a scipy CSR matrix instead of a dense numpy array.

In [37]:
#       (element-wise sq) (row sum) (flatten) (sqrt)
dtm_r = dtm.multiply(dtm).sum(axis=1).A1**0.5


#print(len(dtm_r))
#print(dtm_r)
#print(min(dtm_r), np.median(dtm_r), max(dtm_r))
In [38]:
s = pd.Series(dtm_r)

plt.figure(figsize=(15,6))
s.plot(kind='hist', 
       cumulative=True, 
       normed=1, 
       bins=len(dtm_r), 
       histtype='step', 
       linewidth=2
      )

plt.ylabel('CDF')
#plt.xlim(0, .99*max(dtm_r))
plt.xlim(0, 6)
plt.xlabel('radial distance')
Out[38]:
<matplotlib.text.Text at 0x140bb7ef0>
In [39]:
# This is a super interesting side note: some tweets can totally throw off your distribution. 
# This one Tweet had 114 repetitions of a single character. If you swap the xlim() commands 
#  above, you'll see that r extends to over 100. This is why:
#text_array[ s[ s > 114 ].index[0] ]

<record-stopping screeching noise>

Ok, so I spent some time working with this data, and I'll be honest: I expected this distribution to be much more skewed to large r! In fact, I thought it would be more exaggerated than the blob examples above.

Since I didn't have enough time to dig any deeper for this session, let's keep this observation in the back of our minds, and come back to it in another session.

We can round out today's discussion with one more relevant topic...

Dimensionality reduction

Before we end this session, we'll consider one more facet of high-dimensional spaces: reducing them to lower dimension. For now, we'll illustrate the effect of using principal component analysis using the same inspection techniques we've been using all along.

If we try to densify the 500k+ dimension document term matrix above, we'll run out of RAM. So, let's use a synthetic data set.

First, we look at our metrics in 10,000 dimensions, then after PCA to bring them down to 3.

In [40]:
n = 2000
dims = 10000
blobs = 10


X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)

#X_df_no_r.head()
In [41]:
kde_cdf_plot(X_df, norm=True)
plt.xlim(0,1)
Out[41]:
(0, 1)
In [42]:
in_sample_dist(X,n)

Note how extreme the "us versus them" scenario appears in that histogram!

Now, we know that the data generation process built in the notion of identifiable clusters. Let's see if we can surface that information by projecting our high-dimensional data and space down into a smaller number using principal component analysis.

In [43]:
# now apply PCA and reduce the dimension down to 3
pca = PCA(n_components=3)

X_df_3d = pd.DataFrame(pca.fit_transform(X_df_no_r), columns=['x0','x1','x2'])
In [44]:
# add in that radial distance column
X_df_3d['r'] = X_df_3d.apply(radius, axis=1)

X_df_3d.head()
Out[44]:
x0 x1 x2 r
0 -215.250778 -138.558699 193.697431 321.014182
1 -110.868005 280.171376 211.281449 368.004844
2 -276.175018 -334.101613 -73.344600 439.631617
3 -109.892903 278.921007 210.614312 366.376537
4 164.134443 -96.761056 -92.247380 211.689387
In [45]:
# add in the labels so we can color by them
X_df_3d['y'] = y
In [46]:
# nb: using the vars kwarg seems to remove the ability to include KDE
sns.pairplot(X_df_3d, 
             vars=['x0','x1','x2'], 
             hue='y', 
             palette="husl",
             diag_kind='kde',  
             plot_kws=dict(s=50, alpha=0.7)
            )
Out[46]:
<seaborn.axisgrid.PairGrid at 0x11cc93d30>
In [47]:
kde_cdf_plot(X_df_3d, norm=True)
In [48]:
#in_sample_dist(X_df_3d[['x0','x1','x2']],n)

Given the two plots just above, it seems like we've both done a good job of representing the underlying clusters in our lower-dimensional space, and moved the data away from the extreme edges of the feature space. We should expect that both our algorithms can run more efficiently (faster), and achieve a higher level of significance.

Still to come...

In future installments, I look forward to:

  • a deeper understanding of our typical high-dimensional text feature space
  • strategies for dealing with distance calculations in high dimensions
  • ... probably some other stuff, too...
Go Top
comments powered by Disqus