Practical NLP for Healthcare: Patient Segmentation For Better Care

Mai
8 min readJun 28, 2021

In healthcare, data is massive, highly dimensional, seasonal, and mysterious like the subterranean world of the Pacific Ocean! Just as a well-built submarine and a team of trained divers help us navigate and understand the diverse arrays of life forms deep in the big blue sea, advances in machine learning enable us to glean unique insights from the ever dynamic and evolving healthcare data. First, clustering patients into similar groups to provide targeted, customized, and better care is as crucial as ever. In addition, unstructured data or texts, such as doctor’s notes, medical literature, and patients’ profiles, are becoming more prevalent, making Natural Language Processing (NLP) tools very useful and powerful. In this tutorial, we will do a deeper dive into building a starter NLP pipeline that segments similar patients using Gaussian Mixture Models (GMM) clustering. We will also explore hierarchical clustering and visualize the relationships between the clusters with a dendrogram.

Courtesy of Google Images

Why Clustering?

Early in our data science projects, we often rely on patterns discovered through our own exploratory data analysis and domain knowledge from subject matter experts to quickly segment patients into different groups for actionable insights. However, these rule-based methods could lead to tunnel-visions and then biased approaches in our analyses. Unsupervised methods in machine learning, such as clustering, then enables us to reduce our blind-spots, uncover previously-unknown patterns, and produce the following advantages:

  1. Early Interventions For At-Risk Groups

During clustering, we could discover previously-unidentified, at-risk groups of patients efficiently through methods like constructing a heat-map of clusters based on values of the patients’ attributes. We could, then, pro-actively apply specific and unique interventions for at-risk groups to improve care.

2. Equitable Healthcare For Under-Served Groups

Based on the above-mentioned point, clustering then could help us uncover specific groups of patients that may not be receiving as much resources and care they need. Identifying these communities early in our data science projects would help stake-holders better plan for and roll-out more resources for them, further removing barriers to healthcare.

3. A Targeted Model-Building Strategy

If we build a machine learning model to predict patients’ behaviors, either adding cluster assignments as features or even building a separate model for each cluster is likely to produce model results that better reflect patients’ attributes and needs unique to the community or cluster they belong to.

K-Means Clustering vs. GMM Clustering

In this tutorial, we will use GMM clustering instead of the widely-used K-Means. K-Means divides data into a fixed number k of clusters. A centroid is an imaginary or a real data point at the cluster’s center. Each data point is then assigned to the closest centroid, and a cluster entails a group of data points that are closest to the centroid. The following steps taken by K-Means when constructing clusters:

  1. Randomly initializing centroids and their respective clusters;

2. Assigning each data point to the closest centroid and its respective cluster based on the Euclidean distance between the centroid and the data point;

3. Calculating the mean of all data points in each cluster and moving its centroid to that average; and

4. Repeating steps 2 and 3 until all centroids are all stable — that is, when data points are no longer assigned to different clusters.

On one hand, K-Means is fast and easy to implement. On the other hand, randomly initializing centroids could give us different centroids and their respective clustering, making cluster assignments uncertain. Hence, K-Means is very sensitive to initial centroid and cluster assignments and could produce many different solutions, especially when data points in clusters are more heterogenous or are not spherically shaped. Furthermore, the hard assignments of data points to specific clusters become problematic if there are distinct but smaller groups of data points that are very different from others. Even though the data points in these groups are extremely far from a centroid compared to others, assigning them to a specific cluster just because they happen to be the closest to its centroid makes the cluster misleading, because the cluster then is not representative of these distinct but smaller groups.

Mixture models address this problem via “soft” assignment of data points to clusters based on probabilities. But because K-Means is fast and simple to implement, we could use it for pre-clustering purposes — that is, dividing data points into sub-spaces where more sophisticated clustering algorithms could be applied. K-Means’s approach to clustering is known as hard clustering where one data point is assigned to exactly one cluster. Unlike K-Means, GMM clustering takes a more probabilistic approach in cluster assignments. This flexible approach, known as soft clustering, assigns a data point to a cluster based on its probabilistic score that indicates the strength of its association to that cluster. Unlike K-Means that’s sensitive to clusters with heterogenous shapes, GMM is flexible enough to accommodate different shapes of clusters and potential correlations of data points within them. Even though GMM’s run-time could be slower than K-Means due to calculations of probabilistic scores, GMM’s flexible approach helps us capture the seasonality and diversity present in most real-world data, such as healthcare.

Building a Simple NLP Pipeline

Let’s start by building a simple NLP pipeline that segments patients into similar groups using Term Frequency-Inverse Document Frequency (TF-IDF) and cosine similarity. For this tutorial’s purpose, I’ve randomly collected a sample of public figures’ biographies from Wikipedia. These figures include a diverse array of politicians, scientists, writers, religious leaders, artists, and etc.

Before we start building the pipeline, import required Python libraries:

First, convert the text data for each public figure’s name and biography into a pandas data frame:

Second, perform pre-processing on the text data using tokenization, removal of OCR noises, and stemming with the following function:

Third, create a matrix of TF-IDF vectors and use the pre-processing function from the second step as the input for the tokenization parameter. TF-IDF is a widely-used and simple NLP method that creates a vector based on how often a word appears in a document, weighted by the frequency of its appearance in other documents. In other words, TF-IDF identifies unique and significant words for a document in question despite noises or frequently-occurring words.

Finally, calculate similarity distances between people using pair-wise cosine similarity with the following function. Use the matrix of tf-idf vectors from the last step:

GMM Clustering and Review

Now that we have our starter NLP pipeline built, let’s cluster the people via GMM clustering using the matrix of tf-idf vectors from the previous step:

Now, let’s visualize how many people exist in each cluster:

Based on the chart above, most people belong to cluster two. Let’s look at a couple of people in these clusters to get a sense of who they are.

Cluster 2 mostly has political leaders:

Cluster 0 mostly has the creative types — writers, inventors, and scientists!

Cluster 1 mostly has musicians:

Hierarchical Clustering via Dendrogram

Now that we’ve assigned these public figures into different clusters based on their similarities, let’s go a step further by performing hierarchical clustering on them based on the cosine similarity distances we calculated earlier in this tutorial. In a nutshell, hierarchical clustering merges similar clusters together and creates a hierarchy of clusters based on a defined similarity metric. Using hierarchical clustering, we could quickly detect relationships between clusters and sub-groups within each cluster.

The following codes perform hierarchical clustering and output a dendrogram visualizing the hierarchy of clusters and various relationships between them and their sub-groups :

Hierarchy of Clusters

Zooming into some of the clusters and their sub-groups in the dendrogram, we can clearly see their relationships and quickly assess the most similar people for a given public figure.

In the following cluster for musicians, there are two sub-groups: one group containing rock and pop bands from earlier periods (Nirvana, Gun N’ Roses, Aerosmith, and etc.) and a second group containing pop stars from a more recent period (Justin Bieber, Taylor Swift, and etc.). Justin Bieber and Taylor Swift are similar to each other and share the same branch, just as Nirvana and Gun N’ Roses are also similar to each other, because both pairs probably appeal to different demographics groups.

Dendrogram of the Musician Cluster

Furthermore, we can also see two sub-groups emerging in the cluster for religious leaders: one group consisting of mainly American tele-evangelists from a more recent period and another containing mainly European preachers from a much earlier period. Jerry Falwell and Pat Robertson are on the same branch, probably due to their similar conservative ideologies in the American Evangelical Movement, while John Calvin and Huldrych Zwingli also share the same branch, also probably due to their similar reformist ideas during Europe’s Protestant Reformation.

Dendrogram of the Preacher Cluster

Conclusion

In this tutorial, we have explored various clustering techniques and their benefits. Going forward, we could further make use of the clustered data through constructing experimental designs and making predictions based on patient segments. Depending on the use case, stakeholders could cautiously customize services based on the patient segments and then determine which groups of patients most benefit from the changes in the experimental designs. Since each patient has a segment or a cluster he or she best identifies with, we could consider patient segment as an engineered feature during predictive modeling. Most importantly, we could also utilize additional insights gained from experimental designs as features to better capture and predict patients’ changing and unique behaviors and needs.

Juypter notebook used in this tutorial is available here:

--

--

Mai

Just another data scientist, passionate about leveraging data and technology to create a more equitable world! Views are my own.