This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled change constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists – Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company – STARDATAPEPS – will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.
In the previous blog we introduced our data science rock stars. In the next articles they will, all in their own way, predict which employees are most likely to leave the company.
But before we start this journey, lets take a quick look at what kind of information we actually have available for this quest. The source data is available here (https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists). In this notebook, we’ll download it directly from kaggle so you can reproduce all we do in the next blogs. Also, we will do some preparation to this data.
Loading the data directly from Kaggle
To get the dataset from kaggle you need your kaggle user name and api token. These are included in the kaggle.json file which you can download from your kaggle account page. See https://www.kaggle.com/docs/api for details.
!pip install -q kaggle
Replace [YOUR_PERSONAL_KAGGLE_USER_NAME] and [YOUR_PERSONAL_KAGGLE_KEY] in the cell below with your own credentials and run this to download the data directly from Kaggle. Of course it is also possible to download the files direcly from the Kaggle website.
%sh
export KAGGLE_USERNAME=[YOUR_PERSONAL_KAGGLE_USER_NAME]
export KAGGLE_KEY=[YOUR_PERSONAL_KAGGLE_KEY]
kaggle datasets download arashnic/hr-analytics-job-change-of-data-scientists
Now that we have downloaded the data, we can unzip it and explore it.
# import required modules
import zipfile
import pandas as pd
# open zipped dataset
with zipfile.ZipFile('/databricks/driver/hr-analytics-job-change-of-data-scientists.zip') as z:
# open the csv file in the dataset
with z.open("aug_train.csv") as f:
# read the dataset
df = pd.read_csv(f)
# display dataset
print(df.columns)
print(df.head())
Data preparation
We do some preparation to the original data to meet our needs.
target
The column target indicates whether or not a data professional is open to change a job. This is also what our heroes will be predicting. Let’s make it an integer before we go on.
#Convert target to integer:
df['target'] = df['target'].astype('int32')
city
Unfortunately, the city names are not included in this dataset. To be able to interpret model results, we’ve added a city name. We couldn’t find a source that provided the names of US cities by their City Development Index. Therefore, we’ve ranked the cities on the city development index in the data and matched the data with the Innovation Index dataset provided by StatsAmerica that ranks metropolitan areas by innovation index – a somewhat similar metric. It’s unlikely we map the exact city name that goes with the city code in the data. Nevertheless, we prefer a real name in our examples over a city id which is impossible to interpret.
# create dataframe with cities and a rank based on the city development index
cities = df.groupby('city')['city_development_index'].mean().reset_index()
cities['rank'] = cities['city_development_index'].rank(ascending=False,method='first').astype(int)
cities.head()
city | city_development_index | rank | |
---|---|---|---|
0 | city_1 | 0.847 | 33 |
1 | city_10 | 0.895 | 20 |
2 | city_100 | 0.887 | 23 |
3 | city_101 | 0.558 | 112 |
4 | city_102 | 0.804 | 43 |
Next, we download the statsamerica.org city data. We use xlrd==1.2.0 to be able to load the xlsx data into a pandas dataframe.
!pip install xlrd==1.2.0
# download innovation index values of metropolitan areas and prepare for matching to cities
citydata = pd.read_excel('http://www.statsamerica.org/ii2/docs/downloads/Metros.xlsx', sheet_name='Data')
citydata = citydata[citydata.code_description=='Headline Index'][['description','index value']].rename(columns={'description':'city name'})
citydata['rank'] = citydata['index value'].rank(ascending=False,method='first').astype(int)
citydata = citydata[['rank','city name']]
citydata.head()
rank | city name | |
---|---|---|
0 | 230 | Abilene TX (Metro) |
78 | 134 | Akron OH (Metro) |
156 | 380 | Albany GA (Metro) |
234 | 317 | Albany OR (Metro) |
312 | 72 | Albany-Schenectady-Troy NY (Metro) |
Next, we add the city names based on the rank of the cities, both in our data as well as in the statsamerica.org data.
# create dataframe with city id and city name matched on index ranks and add city name to our data
city_name = cities.join(citydata.set_index('rank'), on='rank')[['city','city name']]
df = df.join(city_name.set_index('city'), on='city')
df[['city','city name','city_development_index']].head()
city | city name | city_development_index | |
---|---|---|---|
0 | city_103 | Denver-Aurora-Lakewood CO (Metro) | 0.920 |
1 | city_40 | Odessa TX (Metro) | 0.776 |
2 | city_21 | Auburn-Opelika AL (Metro) | 0.624 |
3 | city_115 | Corvallis OR (Metro) | 0.789 |
4 | city_162 | Tulsa OK (Metro) | 0.767 |
indicator for relevent_experience
The data also contains a textual feature indicating whether an employee has relevant work experience. Let’s turn that into a dummy (indicator) field right away:
df['relevent_experience'].value_counts()
#Convert relevent_experience to a dummy:
df['ind_relevent_experience'] = 0
df.loc[df['relevent_experience'] == 'Has relevent experience', 'ind_relevent_experience'] = 1
df.drop(columns=['relevent_experience'], inplace = True)
df['ind_relevent_experience'] = df['ind_relevent_experience'].astype('int32')
experience_num
Another feature is about the length of the working experience. This feature now is textual with values ‘<1’ and ‘>20’. To be able to use it as a numeric feature, we create a new feature, replacing ‘>20’ with 22 and ‘<1’ with 0.
#Creating a numeric version of experience:
df['experience_num'] = df.experience
df.experience_num.replace('>20','22',inplace=True) # replacing special chars(like >,+) with numbers
df.experience_num.replace('<1','0',inplace=True)
df['experience_num'] = df['experience_num'].astype('float32')
Check out our data and save
After these changes to the original dataset, the data we will work with in the next blogs looks like this:
df.head()
enrollee_id | city | city_development_index | gender | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | target | city name | ind_relevent_experience | experience_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8949 | city_103 | 0.920 | Male | no_enrollment | Graduate | STEM | >20 | NaN | NaN | 1 | 36 | 1 | Denver-Aurora-Lakewood CO (Metro) | 1 | 22.0 |
1 | 29725 | city_40 | 0.776 | Male | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0 | Odessa TX (Metro) | 0 | 15.0 |
2 | 11561 | city_21 | 0.624 | NaN | Full time course | Graduate | STEM | 5 | NaN | NaN | never | 83 | 0 | Auburn-Opelika AL (Metro) | 0 | 5.0 |
3 | 33241 | city_115 | 0.789 | NaN | NaN | Graduate | Business Degree | <1 | NaN | Pvt Ltd | never | 52 | 1 | Corvallis OR (Metro) | 0 | 0.0 |
4 | 666 | city_162 | 0.767 | Male | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0 | Tulsa OK (Metro) | 1 | 22.0 |
df.shape
This set has 19.158 records and various columns, that hopefully can help to predict who will be open for a job change. Most of the information is related to the education which data scientists have followed and the company where they are currently working.
Let’s save the data for later use.
output = df.to_csv('df_prepared.csv')
For your convenience, we’ve also uploaded this data to our azure blob storage, so you can immediately load this prepared data in our later blogs. You will load the prepared data like this:
df_prep = pd.read_csv('https://bhciaaablob.blob.core.windows.net/featurenegineeringfiles/df_prepared.csv')
df_prep.head()
Unnamed: 0 | enrollee_id | city | city_development_index | gender | enrolled_university | education_level | major_discipline | experience | company_size | company_type | last_new_job | training_hours | target | city name | ind_relevent_experience | experience_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8949 | city_103 | 0.920 | Male | no_enrollment | Graduate | STEM | >20 | NaN | NaN | 1 | 36 | 1 | Denver-Aurora-Lakewood CO (Metro) | 1 | 22.0 |
1 | 1 | 29725 | city_40 | 0.776 | Male | no_enrollment | Graduate | STEM | 15 | 50-99 | Pvt Ltd | >4 | 47 | 0 | Odessa TX (Metro) | 0 | 15.0 |
2 | 2 | 11561 | city_21 | 0.624 | NaN | Full time course | Graduate | STEM | 5 | NaN | NaN | never | 83 | 0 | Auburn-Opelika AL (Metro) | 0 | 5.0 |
3 | 3 | 33241 | city_115 | 0.789 | NaN | NaN | Graduate | Business Degree | <1 | NaN | Pvt Ltd | never | 52 | 1 | Corvallis OR (Metro) | 0 | 0.0 |
4 | 4 | 666 | city_162 | 0.767 | Male | no_enrollment | Masters | STEM | >20 | 50-99 | Funded Startup | 4 | 8 | 0 | Tulsa OK (Metro) | 1 | 22.0 |
Now that we introduced you to our data science team and to the data, let the modeling begin!
Overview of links to blogs:
- Introducing our data science rock stars
- Data to predict which employees are likely to leave
- Good model by default using XGBoost
- Hyperparameter tuning for hyperaccurate XGBoost model
- Beat dirty data
- The case of high cardinality kerfuffles
- Guide to manage missing data
- Visualise the business value of predictive models
- No data scientist is the same!
All Python notebooks from this series are available on our Gitlab page.