Think twice before dropping that first one-hot encoded column

Many machine learning models demand that categorical features are converted to a format they can comprehend via a widely used feature engineering technique called one-hot encoding. Machines aren't that smart.

A common convention after one-hot encoding is to remove one of the one-hot encoded columns from each categorical feature. For example, the feature sex containing values of male and female are transformed into the columns sex_male and sex_female, each containing binary values. Because using either of these columns provides sufficient information to determine a person's sex, we can drop one of them.

In this post, we dive deep into the circumstances where this convention is relevant, necessary, or even prudent.

Read more…

An opinionated guide for gearing up for data science

This post couldn't be any more overdue, but going forward, I'm hoping to be more active and to continue sharing my data science knowledge, particularly the nuances that you acquire on the job.

Whether you're a data scientist, machine learning engineer, or data engineer, your day-to-day typically involves writing code—we are developers after all. Today I'd like share my idiosyncrasies thoughts for setting up a solid local machine for data science, sprinkled with tips and software engineering best practices. What this post doesn't cover are prerequisites for entering the field.

Read more…

A few words about my experience at Insight Data Science

For the past few months, I attended Insight Data Science—a self-directed fellowship (not a bootcamp) designed to help PhDs from all fields transition into a career as a data scientist in industry. I'll say it upfront: Insight was the most challenging and intense professional endeavor I've undertaken (tops even the PhD or building a nonprofit for me!), but also one of the most rewarding. I'd like to take this opportunity to share some of my experiences.

Read more…

The next stage in my data science training

So far my data science training has been entirely self-directed but I'm aware completing the final steps—networking and landing a job—can be exceedingly difficult on your own. Because it's been nearly 7 months since I decided to embark on this journey, I figured this is a good opportunity to share my plan going forward.

Read more…

How to install Keras with a TensorFlow backend for deep learning

Some of the biggest challenges I've faced while teaching myself data science have been determining what tools are available, which one to invest in learning, or how to access them. For example, once I reached the stage in my training where I was ready to add deep learning to my repertoire, I was baffled on how troublesome it was to setup Keras and TensorFlow to work with Jupyter notebooks via the Anaconda distribution. Most solutions glossed over key steps, others just didn't work. After some digging, I came up with my own solution and decided to share it in detail with the community.

Read more…

Using natural language processing to build a spam filter for text messages

After watching the film Arrival, I developed a deep appreciation for the field of linguistics (also my favorite movie of 2016). Human language is the most unstructured type of data, and yet we effortlessly parse and interpret it, and even generate our own. On the other hand, understanding everyday language is a significant challenge for machines; this is the focus of natural language processing (NLP)—the crossroads between linguistics and AI. In this post, we'll make use of some NLP concepts and combine them with machine learning to build a spam filter for SMS text messages.

Read more…

Training a machine to determine whether a mushroom is edible

It's been awhile since my last blog post but we've been busy with a big move from Houston to Brooklyn. The opportunities in New York City for data science and AI seem endless! I've also been spending some time putting to practice my newly acquired knowledge of machine learning by browsing through open datasets.

One dataset that piqued my interest is the mushroom dataset from the UCI Machine Learning Repository describing different species from the genera Agaricus and Lepiota. The data are taken from The Audubon Society Field Guide to North American Mushrooms, which states "there is no simple rule for determining the edibility of a mushroom". Challenged by this bold claim, I wanted to explore if a machine could succeed here. In addition to answering this question, this post explores some common issues in machine learning and how to use Python's go-to machine learning library, Scikit-learn, to address them.

Read more…

Applying k-means clustering to flow cytometry analysis

Is it possible for a machine to group together similar data on its own? Absolutely—this is what clustering algorithms are all about. These algorithms fall under a branch of machine learning called unsupervised learning. In this branch, we give a machine an unlabeled training set containing data regarding the features but not the classes. Algorithms are left to their own devices to discover the underlying structure concealed within the data. This is in stark contrast to supervised learning, where the correct answers are available and utilized to train a predictive model.

In this post, I'd like to introduce an algorithm called $k$-means clustering and also construct one from scratch. Additionally, I'll demonstrate how this algorithm can be used automate an aspect of a widely used life sciences technique called flow cytometry.

Read more…

Iterables, iterators and generators, oh my! Part 2

In a previous post, we learned about iterators—one of the most powerful programming constructs. Our discussion divulged their role as a fundamental but hidden component of Python's for loop, which led to a startling revelation regarding the for loop itself (no spoilers here). We also discovered how to implement the iterator protocol to create our very own iterators, even constructing ones that represent infinite data structures. In this post, I'd like to build upon our knowledge and introduce a more elegant and efficient means for producing iterators. However, if you're not comfortable with the iterator protocol and the inner workings of iterators, I strongly recommend familiarizing yourself with Part 1 first.

Read more…

Building a logistic regression classifier from the ground up

The logistic regression classifier is a widely used machine learning model that predicts the group or category that an observation belongs to. When implementing this model, most people rely on some library or API: just hand over a dataset and out come the predictions. However, I'm not a fan of using black boxes without first understanding what's going on inside. In fact, lifting the hood on this classifier provides a segue to more complex models such as neural networks. Therefore, in this post, I'd like to explore the methodology behind logistic regression classifiers and walk through how to construct one from scratch.

Read more…