Exploring the Pokemon dataset with pandas and seaborn
The Pokemon dataset is a listing of all Pokemon species as of mid-2016, containing data about their type and statistics. Considering how diverse Pokemon are, I was interested in analyzing this datset to learn how the game is balanced and to potentially identify the best Pokemon, if there exists one. Plus, it's a good excuse for me to practice exploratory data analysis with Python's open-source libraries: Pandas for data analysis and Seaborn for visualizations.
While I could easily analyze the dataset using Excel, it isn't very handy for cleaning messy datasets or telling a story in my opinion. In addition, Excel is proprietary/closed, not capable of processing larger data sets (limited to about a million rows), and its analyses lack repeatability and transparency. For these reasons, I've selected to perform my analyses using Jupyter notebooks.
The Pokemon dataset was made available on Kaggle. Let's begin by loading it and taking a peek inside.
df = pd.read_csv('data/pokemon.csv')
df.head()
Here's a quick description of each column, taken from Kaggle:
-
#
- Pokedex entry number of the Pokemon -
Name
- name of the Pokemon -
Type 1
- each Pokemon has a type, this determines weakness/resistance to attacks [referred to as the primary type] -
Type 2
- some Pokemon are dual type and have 2 [referred to as the secondary type] -
Total
- sum of all stats that come after this, a general guide to how strong a Pokemon is -
HP
- hit points, or health, defines how much damage a Pokemon can withstand before fainting -
Attack
- the base modifier for normal attacks -
Defense
- the base damage resistance against normal attacks -
Sp. Atk
- special attack, the base modifier for special attacks -
Sp. Def
- the base damage resistance against special attacks -
Speed
- determines which Pokemon attacks first each round -
Generation
- refers to which grouping/game series the Pokemon was released in -
Legendary
- a boolean that identifies whether the Pokemon is legendary
Table of contents¶
- Cleaning the dataset
- How are Pokemon numbers distributed across generations?
- What are the most common types of Pokemon?
- What are the strongest and weakest Pokemon species?
- What are the strongest and weakest types of Pokemon?
- Do any types of Pokemon excel at certain statistics over others?
- Are any of the statistics correlated?
- Considerations regarding the results
1. Cleaning the dataset¶
One of our first priorities is ensuring the datset is tidy, meaning each column should represent a single variable and each row represent a single data point (Pokemon species in this case). According to Bulbapedia, there are only 721 Pokemon species by generation 6 so we should check if there are any anomalies or duplicated rows in our dataset. In addition, our initial inspection also revealed some Pokemon have missing values for Type 2
, which confirms that not all Pokemon have a secondary type; we have to account for these missing values. We also notice the dataset has two observational units: the Pokemon's identity and statistics, which should be normalized into two tables to prevent future inconsistencies.
But before any rigorous cleaning, let's rename the #
column to id
, and convert all column labels to lower case.
df.rename(columns={'#': 'id'}, inplace=True)
df.columns = df.columns.str.lower()
Now let's take a look at any duplicated rows via the id
column.
df[df.duplicated('id', keep=False)].head()
We can clearly see that some Pokemon have multiple "forms" but share the same id
value. Since these additional forms represent the same Pokemon species albeit with different statistics, let's mindfully exclude these duplicates in our analysis except for the first instance.
df.drop_duplicates('id', keep='first', inplace=True)
Since all Pokemon species have a primary type but not necessarily a secondary type, we'll fill in these missing values with a placeholder.
df['type 2'].fillna(value='None', inplace=True)
The dataset contains both information regarding the identity and statistics of each Pokemon species; therefore, let's separate these two observational units into separate tables: pokedex
and statistics
.
pokedex = df[['id', 'name', 'type 1', 'type 2', 'generation', 'legendary']]
statistics = pd.merge(
df,
pokedex,
on='id'
).loc[:, ['id', 'hp', 'attack', 'defense', 'sp. atk', 'sp. def', 'speed',
'total']]
Now let's review the two tables along with our newly made changes.
pokedex.head()
statistics.head()
Everything looks good and the dataset is tidy! It's now time to get to those questions.
2. How are Pokemon numbers distributed across generations?¶
We'll start by taking a look at the total number of Pokemon in each generation.
sns.factorplot(
x='generation',
data=pokedex,
kind='count'
).set_axis_labels('Generation', '# of Pokemon');
There doesn't seem to be a trend across generations; however, even-numbered generations introduced fewer Pokemon as compared to the odd-numbered generations.
Let's dig a bit deeper and examine the distribution of primary types of Pokemon across generations.
sns.factorplot(
x='generation',
data=pokedex,
col='type 1',
kind='count',
col_wrap=3
).set_axis_labels('Generation', '# of Pokemon');
Most types of Pokemon have similar counts across generations but a few interesting tidbits standout:
- Generation 1 included quite a few Poison-type Pokemon but they've been more or less an afterthought in later generations.
- Generation 5 received a surge of new Psychic-type and Dark-type Pokemon, while Steel-type Pokemon received a large boost in generation 3.
- Normal-type Pokemon had a strong presence in every generation except generation 6.
3. What are the most common types of Pokemon?¶
It's been a long time since I've played the Pokemon games so let's begin by investigating if there are any unique primary or secondary Pokemon types, or if they simply share the same ones.
unique_type1 = np.setdiff1d(pokedex['type 1'], pokedex['type 2'])
unique_type2 = np.setdiff1d(pokedex['type 2'], pokedex['type 1'])
print('Unique Type 1: ', end='')
if unique_type1.size == 0:
print('No unique types')
else:
for u in unique_type1:
print(u)
print('Unique Type 2: ', end='')
if unique_type2.size == 0:
print('No unique types')
else:
for u in unique_type2:
print(u)
There are no unique primary or secondary types but as noted earlier, some Pokemon species don't have a secondary type, as pointed out by the unique secondary type value of None
. Now let's count the total number of primary and secondary types.
type1, type2 = pokedex.groupby('type 1'), pokedex.groupby('type 2')
print('Type 1 count: {}'.format(len(type1)))
print('Type 2 count: {}'.format(len(type2)))
There are a total of 18 primary types and 19 secondary types (again, the additional secondary type refers to Pokemon species without a secondary type).
To determine the most common primary and secondary type, let's examine the distributions for each.
sns.factorplot(
y='type 1',
data=pokedex,
kind='count',
order=pokedex['type 1'].value_counts().index,
aspect=1.5,
color='green'
).set_axis_labels('# of Pokemon', 'Type 1')
sns.factorplot(
y='type 2',
data=pokedex,
kind='count',
order=pokedex['type 2'].value_counts().index,
aspect=1.5,
color='purple'
).set_axis_labels('# of Pokemon', 'Type 2');
We can draw a few conclusions from these plots:
- Nearly half of all Pokemon species only have a primary type.
- Normal-type is a very common primary type and if I remember correctly, this type was considered "generic" back in generation 1 and had mediocre statistics. However, it's surprising that Normal-type Pokemon are outnumbered by Water-type Pokemon; perhaps nowadays Water-type Pokemon are also considered "generic"?
- Nearly 1 in 7 Pokemon species have a secondary Flying-type.
- Pokemon species with Flying-type as their primary type are very rare.
These findings are curious but it'd be more exciting to investigate the distribution of the various combinations of primary and secondary types of Pokemon.
dual_types = pokedex[pokedex['type 2'] != 'None']
sns.heatmap(
dual_types.groupby(['type 1', 'type 2']).size().unstack(),
linewidths=1,
annot=True
);
This plot reveals that five most common combinations of primary and secondary type are in order:
- Normal/Flying-type
- Bug/Flying-type
- Bug/Poison-type
- Grass/Poison-type
- Water/Ground-type
It would be worth determining if the most common combinations are relatively weaker in order to keep the game balanced (we'll examine this later). I find it paradoxical there actually exist Pokemon species that are Fire/Water-type and Ground/Flying-type!
Just for kicks, let's take a look at the distribution of types for those Pokemon species that lack a secondary type.
single_types = pokedex[pokedex['type 2'] == 'None']
sns.factorplot(
y='type 1',
data=single_types,
kind='count',
order=single_types['type 1'].value_counts().index,
aspect=1.5,
color='grey'
).set_axis_labels('# of Pokemon', 'Type 1');
Nothing too surprising here besides the considerably large proportion of Pokemon that are only Normal-type or Water-type!
4. What are the strongest and weakest Pokemon species?¶
One potential approach to this question is to rank Pokemon according to the sum of their six statistics. Fortunately, this is already available to us as total
in the statistics
table. Let's use this metric to identify the ten strongest Pokemon.
pd.merge(
pokedex,
statistics,
on='id'
).sort_values('total', ascending=False).head(10)
Immediately, we can see that using this metric introduces a major hurdle: at least nine Pokemon are tied for second place with a total
value of 680—it wouldn't be surprising if there were other ties. Therefore, summing the statistics is not a meaningful way to answer our question.
Let's try a different metric: standardize the six statistic columns independently by converting each value into a z-score so when we do take the sum, we account for the variation in the each statistic using its mean and standard deviation across all Pokemon species. As a reminder, a z-score is defined by
$$z = \frac{x-\mu}{\sigma}$$
There's no point in keeping the total
column anymore so let's ignore it when we make a new standardized table. In addition, we'll need to temporarily move the id
column to the row labels so we don't standardize its values.
std_stats = statistics.drop('total', axis='columns').set_index('id').apply(
lambda x: (x - x.mean()) / x.std())
We'll define a new column, strength
, as the sum of the z-scores of each statistic—the higher this value, the stronger the Pokemon.
std_stats['strength'] = std_stats.sum(axis='columns')
We can now add the id
column back to the table and determine the ten strongest Pokemon based on their strength
value.
std_stats.reset_index(inplace=True)
pd.merge(
pokedex,
std_stats,
on='id'
).sort_values('strength', ascending=False).head(10)
That's more like it! We now have a definitive ranking of the 10 strongest Pokemon species in order: Arceus, Giratina, Lugia, Ho-oh, Xerneas, Yveltal, Mewtwo, Reshiram, Palkia and Rayquaza. Not surprisingly, they're all legendary Pokemon! And even less shocking, Arceus is colloquially known as the "god" Pokemon.
Now let's take a look at the ten weakest Pokemon.
pd.merge(
pokedex,
std_stats,
on='id'
).sort_values('strength').head(10)
The ten weakest Pokemon are in order: Sunkern, Azurill, Kricketot, Wurmpie, Weedle, Caterpie, Ralts, Scatterbug, Magikarp and Feebas. Interestingly, many of these Pokemon are Bug-type.
It would also be worth identifying the strongest non-legendary Pokemon since these are easier to catch in-game as compared to legendary ones.
pd.merge(
pokedex[~pokedex['legendary']],
std_stats,
on='id'
).sort_values('strength', ascending=False).head(10)
The ten strongest non-legendary Pokemon are in order: Slaking, Goodra, Garchomp, Hyregion, Salamence, Dragonite, Tyranitar, Metagross, Archeops and Blissey. Slaking commands a strong lead. Regardless, many of these are Dragon-type Pokemon. Very intriguing!
5. What are the strongest and weakest types of Pokemon?¶
Since our last analysis hinted that certain types of Pokemon may be stronger than others, let's take a look at the strongest combinations of primary and secondary types. In addition, instead of using the mean as a measure of central tendency and assuming the strengths of each type are normally distributed, let's use the median.
joined = pd.merge(
pokedex,
std_stats,
on='id'
)
medians = joined.groupby(['type 1', 'type 2']).median().loc[:, 'strength']
sns.heatmap(
medians.unstack(),
linewidths=1,
cmap='RdYlBu_r'
);
The heatmap is teeming with information but it's difficult to rank the strongest combinations of primary and secondary types. Let's identify the top five directly.
medians.reset_index().sort_values('strength', ascending=False).head()
We can draw a few conclusions from these data:
- The five strongest combinations of primary and secondary types are in order: Ghost/Dragon, Dragon/Fire, Steel/Dragon, Dragon/Electric and Dragon/Ice.
- Among the strongest Pokemon are some primary Psychic-types, secondary Fighting-types and secondary Flying-types.
- The strongest Pokemon tend to have Dragon-type as either their primary or secondary type, and among the weakest Pokemon are primary Bug-types. Our suspicions are confirmed!
Since legendary Pokemon species are a) rare and b) typically vastly stronger than non-legendary Pokemon, the former group may be confounding our results. Therefore, let's perform the same analysis as above without legendary Pokemon.
joined_nolegs = pd.merge(
pokedex[~pokedex['legendary']],
std_stats,
on='id'
)
medians = joined_nolegs.groupby(['type 1',
'type 2']).median().loc[:,'strength']
sns.heatmap(
medians.unstack(),
linewidths=1,
cmap='RdYlBu_r'
);
Let's also list the five strongest combinations.
medians.reset_index().sort_values('strength', ascending=False).head()
The results change a bit when excluding legendary Pokemon: Dragon-type is not as dominating; in fact, there's more diversity in strength among the different types. This also indicates that many legendary Pokemon species are Dragon-type, and the game maintains balance by ensuring these Pokemon are a lot less common. Unfortunately for Bug-type Pokemon, even though they're the fourth most common primary type, they're still among the weakest of all.
Our earlier analysis pointed out the most common combination of primary and secondary types: Normal/Flying, Bug/Flying, Bug/Poison, Grass/Poison and Water/Ground. If we look at the strength of these combinations, we see that none of them are exceptionally strong or weak—the game is indeed balanced in this regard!
6. Do any types of Pokemon excel at certain statistics over others?¶
We've seen that a Pokemon's type can influence how strong it is. Now let's investigate if any specific statistic is driving these results and whether different types of Pokemon excel at a particular statistic. Since all Pokemon have a primary type, let's restrict our analysis to primary types only.
sns.heatmap(
joined.groupby('type 1').median().loc[:, 'hp':'speed'],
linewidths=1,
cmap='RdYlBu_r'
);
We can draw a few conclusions from these data:
- Flying-type Pokemon are really fast, which makes intuitive sense.
- Among the Pokemon with the highest defense are Rock- or Steel-type Pokemon. There is method to the madness after all!
Again, since including legendary Pokemon species may be potentially concealing some interesting tidbits, let's take a look at the same analysis but exclude legendary Pokemon.
sns.heatmap(
joined_nolegs.groupby('type 1').median().loc[:, 'hp':'speed'],
linewidths=1,
cmap='RdYlBu_r'
);
It looks like we can gain more insight into this data by leaving out legendary Pokemon:
- The fastest Pokemon are Flying-type or Electric-type, while Fairy-type, Rock-type or Steel-type are slowpokes.
- Fighting-type Pokemon have unparalleled attack power but the worst special attack power.
- Psychic-type, Flying-type and Fairy-type Pokemon have abysmal attack power. The former two at least make it up by excelling elsewhere.
- No type of Pokemon have standout HP or special defense.
- Water-type Pokemon have average statistics across the board, which confirms our earlier hunch as they're the most common type of Pokemon. Another good example of the game maintaining balance!
- Rock-type and Steel-type Pokemon still have the absolute best defense but lack in speed. Perhaps there is a correlation between these two statistics?
7. Are any of the statistics correlated?¶
Our findings have hinted certain types of Pokemon that excel at a particular statistic don't fare so well in other statistics. To determine if the statistics are correlated, let's define a function that will compute Pearson's correlation coefficient for each pair of statistics as a measure of the goodness of fit of a linear regression model.
def show_corr(x, y, **kwargs):
(r, _) = stats.pearsonr(x, y)
ax = plt.gca()
ax.annotate(
'r = {:.2f}'.format(r),
xy=(0.45, 0.85),
xycoords=ax.transAxes
)
Since we've already seen examples where the inclusion of legendary Pokemon in the analysis potentially confounds results, let's take a look at the pairwise comparisons of the statistics without legendary Pokemon.
sns.pairplot(
data=joined_nolegs.loc[:, 'hp':'speed'],
kind='reg'
).map_offdiag(show_corr);
This plot shows that the five most correlated pairs of statistics are in order:
- special attack/special defense
- defense/special defense
- attack/defense
- attack/HP
- special attack/speed
We can see that none of the statistics are strongly correlated, which is actually important as that would favor certain Pokemon or types of Pokemon over others. However, it is worth noting that all of the statistics are positively correlated except for defense and speed—this particular case is epitomized by Rock-type and Steel-type Pokemon!
The above plot also reveals a few outliers with certain statistics remarkably high. First, let's identify the HP outliers.
joined_nolegs.loc[joined_nolegs['hp'] > 6, 'name']
Next, the outliers with high defense.
joined_nolegs.loc[joined_nolegs['defense'] > 5, 'name']
And finally, the outliers with high special defense.
joined_nolegs.loc[joined_nolegs['sp. def'] > 5, 'name']
I remember Chansey having extraordinarily high HP from the original Pokemon games so it's not surprising that Blissey, which is Chansey's evolved form, is up there as well. On a different note, it looks like Shuckle is the most defensive Pokemon by a landslide!
8. Considerations regarding the results¶
Our analysis revealed some wonderful insights about the Pokemon dataset, but it has drawbacks and pitfalls that we must take into account.
Efficacy of the strength metric¶
To determine the strongest Pokemon, we summed up the standardized statistics for each species. While this is better than total
, we still made a critical assumption: each statistic was weighted equally. In the real world, some gamers like to focus on defense to win battles, while others will take advantage of speed or use brute force—it's simply a matter of taste. While our metric approached Pokemon strength objectively as possible, in the end, it may not be meaningful to all players.
In addition, the Pokemon games add another layer of complexity for determining success in battles: the Pokemon's move list. These moves also have their own types (e.g., Fire Blast is a Fire-type move). Each type of move can be more, less or not effective at all depending on the type of Pokemon it has targeted. Moreover, whether a move is successfully executed by a Pokemon has a luck component. Therefore, the player's selection of moves during the battle is ultimately a bigger factor in determining success than the strength of the Pokemon chosen.
Finally, the statistics found in the dataset refer to base values. As a Pokemon levels up and wins battles, the statistics grow. Unfortunately, these statistics grow at different rates. To make things even more convoluted, certain Pokemon naturally excel at specific statistics.
Secret of the heatmaps¶
The heatmaps displaying the strongest and weakest types of Pokemon were incredibly informative. However, they conceal the number of Pokemon representing a given combination of primary and secondary type. This is important in scenarios where there exist only one or two Pokemon for a particular combination. For example, let's take a look at the unequivocally strongest type of Pokemon: Ghost/Dragon.
pokedex[(pokedex['type 1'] == 'Ghost') & (pokedex['type 2'] == 'Dragon')]
It looks like Giratina is the one and only Ghost/Dragon-type Pokemon, and it's also a rare legendary Pokemon! On the other hand, let's take a look at the strongest type of non-legendary Pokemon: Dragon/Flying.
pokedex[(pokedex['type 1'] == 'Dragon') & (pokedex['type 2'] == 'Flying')]
There's more choices available with Dragon/Flying-type Pokemon as compared to Ghost/Dragon-type—not to mention three of them are non-legendary and relatively easier to capture. Therefore, one can see that when determining Pokemon strength, it quickly boils to down being quite subjective. However, if there was a single best Pokemon or type of Pokemon, everyone would naturally gravitate towards it. By adding nuance to the system, the developers have made Pokemon a game that is more than simply optimizing statistics and strength, but a tactical and personal experience. No wonder they've sold nearly 300 million copies!
This exercise was a fun way to become familiar with pandas
and seaborn
, and I can already think of ways to use machine learning to answer questions such as:
- Can we identify a Pokemon's type solely by its photo?
- Can a machine classify the newly released Pokemon in generation 7?
If you want to play around with the code or run the analysis yourself, here's the GitHub repo. Don't hesitate to leave your comments below, especially if you can think of other interesting questions to answer using this dataset!
Comments
Comments powered by Disqus