Applying k-means clustering to flow cytometry analysis

Is it possible for a machine to group together similar data on its own? Absolutely—this is what clustering algorithms are all about. These algorithms fall under a branch of machine learning called unsupervised learning. In this branch, we give a machine an unlabeled training set containing data regarding the features but not the classes. Algorithms are left to their own devices to discover the underlying structure concealed within the data. This is in stark contrast to supervised learning, where the correct answers are available and utilized to train a predictive model.

In this post, we'll not only learn about an algorithm called $k$-means clustering, but construct one from scratch. Additionally, we'll apply this algorithm to automate an aspect of a widely used life sciences technique called flow cytometry.

Read more…

Iterables, iterators and generators, oh my! Part 2

In a previous post, we learned about iterators—one of the most powerful programming constructs. Our discussion divulged their role as a fundamental but hidden component of Python's for loop, which led to a startling revelation regarding the for loop itself (no spoilers here). We also discovered how to implement the iterator protocol to create our very own iterators, even constructing ones that represent infinite data structures. In this post, we'll build upon our knowledge and learn about more elegant and efficient means for producing iterators. However, if you're not comfortable with the iterator protocol and the inner workings of iterators, I strongly recommend familiarizing yourself with Part 1 first.

Read more…

Building a logistic regression classifier from the ground up

The logistic regression classifier is a widely used machine learning model that predicts the group or category that an observation belongs to. When implementing this model, most people rely on some package or API: just hand over a dataset, pick a few parameters and out come the predictions. However, I'm not a fan of using black boxes without first understanding what's going on inside. In fact, lifting the hood on this classifier provides a segue to more complex models such as neural networks. Therefore, this post will explore the methodology behind logistic regression classifiers and walk through how to construct one from scratch.

Read more…

Iterables, iterators and generators, oh my! Part 1

Iterators and generators are among my favorite programming tools—they're also some of the most powerful. These constructs enable us to write cleaner, more flexible and higher performance code; undoubtedly an invaluable addition to any programmer's toolbox. In addition, iterators and generators are an elegant means to work with large and potentially infinite data structures, coming in handy for data science. However, they can be some of the more perplexing concepts to grasp at first.

This article aims to deliver a gentle but in-depth introduction to iterators and generators in Python, although they're prevalent in other languages too. Nevertheless, in order to appreciate generators, we need to first have a good handle on iterators. And to understand iterators, we need to start with iterables.

Read more…

Exploring the Pokemon dataset with pandas and seaborn

The Pokemon dataset is a complete listing of all Pokemon species as of mid-2016, containing data about their type and statistics. Considering how diverse Pokemon are, it's worthwhile analyzing this datset to identify any insights regarding the design of new Pokemon, how the game is balanced and to potentially assist players in selecting the best Pokemon, if there exists one. And having been a fervent Pokemon fan as a kid, I'm also dreadfully curious!

Read more…

What exactly is data science?

I figured I'd focus my first post on a broad topic and what better way than to discuss what this blog will revolve around: data science! Actually, when I talk to people about data science, I usually get blank stares, maybe a head nod or two, and if I'm lucky, someone will ask "is that related to big data?". This is understandable because data science is an emerging field—practically everyone has their own definition, so I'd like to begin by sharing mine.

Read more…