Hey there!
On this article, I’m going to take you thru a quite simple Machine Studying challenge.
On this challenge, we’re going to predict the quantity of energy burnt by an individual utilizing options like Gender, Age, Top, Weight, Period, Heart_Rate, and Body_Temp.
You may get the dataset utilizing the beneath hyperlink.
Dataset: Calories Burnt Dataset
You’re going to get a zipper file after downloading the above dataset, after extraction, you will discover two recordsdata particularly, train.csv and energy.csv.
The options are in train.csv and the label is in energy.csv. See the beneath image.
Copy the Energy column from the energy.csv file and paste it subsequent to the Body_Temp column within the train.csv file, just like the beneath image.
Now save the train.csv file in a special identify, let’s say caloriesburnt.csv.
Now we’re able to go.
Add the dataset into your Google Colab.
It’s time to code.
Let’s load our dataset into a knowledge body utilizing pandas.
import pandas as pd
df = pd.read_csv("caloriesburnt.csv")df.head()
"""
OUTPUT:
User_ID Gender Age Top Weight Period Heart_Rate Body_Temp Energy
0 14733363 male 68 190 94 29 105 40.8 231
1 14861698 feminine 20 166 60 14 94 40.3 66
2 11179863 male 69 179 79 5 88 38.7 26
3 16180408 feminine 34 179 71 13 100 40.5 71
4 17771927 feminine 27 154 58 10 81 39.8 35
"""
As we at all times do let’s use some strategies and attributes to grasp our dataset. We have to know the form of our dataset, the variety of null values current in our dataset, and extra.
df.form # (15000, 9)df.isnull().sum()
"""
OUTPUT:
User_ID 0
Gender 0
Age 0
Top 0
Weight 0
Period 0
Heart_Rate 0
Body_Temp 0
Energy 0
dtype: int64
"""
df.information()"""
OUTPUT:
<class 'pandas.core.body.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Knowledge columns (whole 9 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 User_ID 15000 non-null int64
1 Gender 15000 non-null object
2 Age 15000 non-null int64
3 Top 15000 non-null int64
4 Weight 15000 non-null int64
5 Period 15000 non-null int64
6 Heart_Rate 15000 non-null int64
7 Body_Temp 15000 non-null float64
8 Energy 15000 non-null int64
dtypes: float64(1), int64(7), object(1)
"""
df.describe()"""
OUTPUT:
User_ID Age Top Weight Period Heart_Rate Body_Temp Energy
depend 1.500000e+04 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000
imply 1.497736e+07 42.789800 174.465133 74.966867 15.530600 95.518533 40.025453 89.539533
std 2.872851e+06 16.980264 14.258114 15.035657 8.319203 9.583328 0.779230 62.456978
min 1.000116e+07 20.000000 123.000000 36.000000 1.000000 67.000000 37.100000 1.000000
25% 1.247419e+07 28.000000 164.000000 63.000000 8.000000 88.000000 39.600000 35.000000
50% 1.499728e+07 39.000000 175.000000 74.000000 16.000000 96.000000 40.200000 79.000000
75% 1.744928e+07 56.000000 185.000000 87.000000 23.000000 103.000000 40.600000 138.000000
max 1.999965e+07 79.000000 222.000000 132.000000 30.000000 128.000000 41.500000 314.000000
"""
So, there are a complete of 15000 information in our dataset, I need to know in that 15000 what number of are male and what number of are feminine.
df['Gender'].value_counts()"""
OUTPUT:
Gender
feminine 7553
male 7447
Title: depend, dtype: int64
"""
The distribution of men and women is okay.
Understanding the dataset by graphs and charts is much better than understanding it by seeing tables and numbers.
Let’s visualize some columns utilizing Plotly.
import plotly.specific as px
import matplotlib.pyplot as plt
plt.determine(figsize=(5, 5))
px.bar(df['Gender'].value_counts(), width=500, top=300)
(Are you coding with me?)
Let’s see the worth distribution of the Age column.
px.histogram(df['Age'], width=1200, top=500, text_auto=True)
Utilizing width and top parameters we will modify the scale of our graph. Utilizing the textual content auto parameter we will see the quantity on the high of the bar. See the beneath graph.
Now, utilizing this identical technique I used to see the distribution of the Age column. I need you to see the distribution of Top, Weight, Body_Temp, Period, and Heart_Rate columns.
px.histogram(df['Height'], width=1000, top=500, text_auto=True)
px.histogram(df['Weight'], width=1000, top=500, text_auto=True)
px.histogram(df['Body_Temp'], width=1000, top=500, text_auto=True)
px.histogram(df['Duration'], width=1000, top=500, text_auto=True)
px.histogram(df['Heart_Rate'], width=1000, top=500, text_auto=True)
Let’s see which column within the dataset has the very best correlation with the label column which is our Energy column.
Relating to checking correlation we solely want the numerical columns. So, we will neglect the Gender and User_ID column.
df.iloc[:, 2:]"""
OUTPUT:
Age Top Weight Period Heart_Rate Body_Temp Energy
0 68 190 94 29 105 40.8 231
1 20 166 60 14 94 40.3 66
2 69 179 79 5 88 38.7 26
3 34 179 71 13 100 40.5 71
4 27 154 58 10 81 39.8 35
... ... ... ... ... ... ... ...
14995 20 193 86 11 92 40.4 45
14996 27 165 65 6 85 39.2 23
14997 43 159 58 16 90 40.1 75
14998 78 193 97 2 84 38.3 11
14999 63 173 79 18 92 40.5 98
15000 rows × 7 columns
"""
Now, let’s see the correlation utilizing Heatmap.
px.imshow(df.iloc[:, 2:].corr())
The heatmap says that Period, Heart_Rate, and Body_Temp have a huge effect on Energy burnt. (Which is Apparent).
We don’t have any null worth in our dataset and we simply have one categorical column in our dataset which is Gender.
Let’s exchange male with 0 and feminine with 1.
df.exchange({'Gender':{'male':0, 'feminine':1}}, inplace=True)df.head()
"""
OUTPUT:
User_ID Gender Age Top Weight Period Heart_Rate Body_Temp Energy
0 14733363 0 68 190 94 29 105 40.8 231
1 14861698 1 20 166 60 14 94 40.3 66
2 11179863 0 69 179 79 5 88 38.7 26
3 16180408 1 34 179 71 13 100 40.5 71
4 17771927 1 27 154 58 10 81 39.8 35
"""
Achieved.
y = df['Calories']
X = df.drop(columns=['Calories', 'User_ID'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123)X_train.form, X_test.form # ((12000, 7), (3000, 7))
y_train.form, y_test.form # ((12000,), (3000,))
We’re going to construct an XGBRegressor for this prediction.
from xgboost import XGBRegressor
xgbr = XGBRegressor()
xgbr.match(X_train, y_train)
xgbr_prediction = xgbr.predict(X_test)
from sklearn.metrics import mean_absolute_error as mae, r2_score as r2xgbr_mae = mae(y_test, xgbr_prediction)
xgbr_mae # 1.4849313759878278 (Wow, the error is extraordinarily LOW)
xgbr_r2 = r2(y_test, xgbr_prediction)
xgbr_r2 # 0.9988308899957399 (Rating is de facto GOOD)
Our mannequin is performing fantastically.
Let’s additionally plot a graph which supplies us the distinction between the Precise and the Predicted worth.
px.scatter(x = y_test, y = xgbr_prediction, labels={'y':'Predicted Worth', 'x':'Precise Worth'}, title='Precise Worth Vs Predicted Worth')
The graph is so linear which suggests our mannequin is ready to predict the values so near the precise values.
That’s it.
When you nonetheless haven’t grabbed my FREE E-Guide during which I’ve compiled all my 30 articles which I wrote throughout my 30-Days Machine Studying Mission Problem. Use the beneath hyperlink to get it.
Additionally, don’t overlook to attach with me on X.
X: AbbasAli_X