Predicting Car Price using Machine Learning

Tarique Akhtar
Towards Data Science
8 min readNov 23, 2020

--

In this post, we will learn Linear Regression and real time challenges during implementation for a business problem.

Photo by Joshua Koblin on Unsplash

Problem Description:

There is an automobile company XYZ from Japan which aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Japanese market. Essentially, the company wants to know:

  • Which variables are significant in predicting the price of a car
  • How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the American market.

Business Objectives:

You as a Data scientist are required to apply some data science techniques for the price of cars with the available independent variables. That should help the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels.

Dataset and Python code:

You can download the dataset and respective python code from my Github.

The solution is divided into the following sections:

  • Data understanding and exploration
  • Data cleaning
  • Data preparation
  • Model building and evaluation

Data understanding and exploration:

Summary of data: 205 rows, 26 columns, no null values

Image by author: 205 rows, 26 columns, no null values

The column“Price” is the target variable and rest of the columns are independent variables.

The independent variables are again divided into Categorical and Numerical variables.

Numerical variables: [‘wheelbase’, ‘carlength’, ‘carwidth’, ‘carheight’, ‘curbweight’, ‘enginesize’, ‘boreratio’, ‘stroke’, ‘compressionratio’, ‘horsepower’, ‘peakrpm’, ‘citympg’, ‘highwaympg’]

Categorical variables: [‘symboling’, ‘fueltype’, ‘aspiration’, ‘doornumber’, ‘carbody’, ‘drivewheel’, ‘enginelocation’, ‘enginetype’, ‘cylindernumber’, ‘fuelsystem’ ‘car_name’]

Heatmap to show correlation of Numerical and Target variable:

Now let’s plot Heatmap which is pretty useful to visualise multiple correlations among numerical variables. We have also used the Target variable “Price” to understand the correlation of numerical variables with it.

Image by author: Heatmap to understand correlation with Target variable “Price”

The heatmap shows some useful insights:

Correlation of target variable “Price” with independent variables:

  • Price is highly (positively) correlated with wheelbase, carlength, carwidth, curbweight, enginesize, horsepower (notice how all of these variables represent the size/weight/engine power of the car)
  • Price is negatively correlated to ‘citympg’ and ‘highwaympg’ (-0.70 approximately). This suggest that cars having high mileage may fall in the ‘economy’ cars category, and are priced lower (think Maruti Alto/Swift type of cars, which are designed to be affordable by the middle class, who value mileage more than horsepower/size of car etc.)

Correlation among independent variables:

  • Many independent variables are highly correlated (look at the top-left part of matrix): wheelbase, carlength, curbweight, enginesize etc. are all measures of ‘size/weight’, and are positively correlated

Thus, while building the model, we’ll have to pay attention to multicollinearity (especially linear models, such as linear and logistic regression, suffer more from multicollinearity).

Data Cleaning:

We’ve seen that there are no missing values in the dataset.

We’ve also seen that variables are in the correct format, except “symboling”, which should rather be a categorical variable (so that dummy variable are created for the categories).

We have also done data preprocessing on The variable “CarName” and created a new variable called as “car_company”.

Data Preparation:

Let’s now prepare the data for model building.

Split the data into X and y.

X = cars.loc[:, ['symboling', 'fueltype', 'aspiration', 'doornumber','carbody', 'drivewheel', 'enginelocation', 'wheelbase', 'carlength','carwidth', 'carheight', 'curbweight', 'enginetype', 'cylindernumber','enginesize', 'fuelsystem', 'boreratio', 'stroke', 'compressionratio','horsepower', 'peakrpm', 'citympg', 'highwaympg',
'car_company']]
y = cars['price']

Creating dummy variables for categorical variables.

# subset all categorical variables
cars_categorical = X.select_dtypes(include=['object'])
# convert into dummies
cars_dummies = pd.get_dummies(cars_categorical, drop_first=True)
# drop categorical variables
X = X.drop(list(cars_categorical.columns), axis=1)
# concat dummy variables with X
X = pd.concat([X, cars_dummies], axis=1)

Scaling the features and getting the final list of columns in dataframe for model building.

# scaling the features
from sklearn.preprocessing import scale
# storing column names in cols, since column names are (annoyingly) lost after
# scaling (the df is converted to a numpy array)
cols = X.columns
X = pd.DataFrame(scale(X))
X.columns = cols
X.columns
Image by author: final list of columns in dataframe for model building

Final Train-Test split of data.

# split into train and test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.7,
test_size = 0.3, random_state=100)

Model Building and Evaluation:

Building the first model with all the features

# instantiate
lm = LinearRegression()
# fit
lm.fit(X_train, y_train)
# predict
y_pred = lm.predict(X_test)
# metrics
from sklearn.metrics import r2_score
print(r2_score(y_true=y_test, y_pred=y_pred))

R-squared = 0.83826213934

Not bad, we are getting approx. 83% r-squared with all the variables. Let’s see how much we can get with lesser features.

Let’s now build a model using recursive feature elimination to select features. We’ll first start off with an arbitrary number of features, and then use the “statsmodels” library to build models using the shortlisted features (this is because sklearn doesn’t have adjusted r-squared but statsmodels has).

Choosing the optimal number of features for Model building:

One way to choose the optimal number of features is to make a plot between number of features(n_features) vs adjusted r-squared, and then choose the best value of n_features.

n_features_list = list(range(4, 20))
adjusted_r2 = []
r2 = []
test_r2 = []
for n_features in range(4, 20):# RFE with n features
lm = LinearRegression()
# specify number of features
rfe_n = RFE(lm, n_features)
# fit with n features
rfe_n.fit(X_train, y_train)
# subset the features selected by rfe_6
col_n = X_train.columns[rfe_n.support_]
# subsetting training data for 6 selected columns
X_train_rfe_n = X_train[col_n]
# add a constant to the model
X_train_rfe_n = sm.add_constant(X_train_rfe_n)
# fitting the model with 6 variables
lm_n = sm.OLS(y_train, X_train_rfe_n).fit()
adjusted_r2.append(lm_n.rsquared_adj)
r2.append(lm_n.rsquared)


# making predictions using rfe_15 sm model
X_test_rfe_n = X_test[col_n]
# # Adding a constant variable
X_test_rfe_n = sm.add_constant(X_test_rfe_n, has_constant='add')
# # Making predictions
y_pred = lm_n.predict(X_test_rfe_n)

test_r2.append(r2_score(y_test, y_pred))
# plotting adjusted_r2 against n_features
plt.figure(figsize=(10, 8))
plt.plot(n_features_list, adjusted_r2, label="adjusted_r2")
plt.plot(n_features_list, r2, label="train_r2")
plt.plot(n_features_list, test_r2, label="test_r2")
plt.legend(loc='upper left')
plt.show()
Image by author: Number of Features vs R-squared

Based on the plot, we can choose the number of features considering the r2_score we are looking for. Note that there are a few caveats in this approach, and there are more sophisticated techniques to choose the optimal number of features:

  • Cross-validation: In this case, we have considered only one train-test split of the dataset; the values of r-squared and adjusted r-squared will vary with train-test split. Thus, cross-validation is a more commonly used technique (you divide the data into multiple train-test splits into ‘folds’, and then compute average metrics such as r-squared across the ‘folds’.
  • The values of r-squared and adjusted r-squared are computed based on the training set, though we must always look at metrics computed on the test set. For e.g. in this case, the test r2 actually goes down with increasing n — this phenomenon is called ‘overfitting’, where the performance on training set is good because the model has in some way ‘memorised’ the dataset, and thus the performance on test set is worse.

Thus, we can choose anything between 4 and 12 features, since beyond 12, the test r2 goes down; and at lesser than 4, the r2_score is too less.

In fact, the test_r2 score doesn’t increase much anyway from n=6 to n=12. It is thus wiser to choose a simpler model, and so let’s choose n=6.

Final Model:

Let’s now build the final model with 6 features.

# RFE with n features
lm = LinearRegression()
n_features = 6# specify number of features
rfe_n = RFE(lm, n_features)
# fit with n features
rfe_n.fit(X_train, y_train)
# subset the features selected by rfe_6
col_n = X_train.columns[rfe_n.support_]
# subsetting training data for 6 selected columns
X_train_rfe_n = X_train[col_n]
# add a constant to the model
X_train_rfe_n = sm.add_constant(X_train_rfe_n)
# fitting the model with 6 variables
lm_n = sm.OLS(y_train, X_train_rfe_n).fit()
adjusted_r2.append(lm_n.rsquared_adj)
r2.append(lm_n.rsquared)
# making predictions using rfe_15 sm model
X_test_rfe_n = X_test[col_n]
# # Adding a constant variable
X_test_rfe_n = sm.add_constant(X_test_rfe_n, has_constant='add')
# # Making predictions
y_pred = lm_n.predict(X_test_rfe_n)
test_r2.append(r2_score(y_test, y_pred))# summary
lm_n.summary()
Image by author: OLS Regression result
# results 
r2_score(y_test, y_pred)

r2-squared = 0.88514228773125714

So the model has accuracy of 88.51% on test data which is good. There are other ways of model evaluation as well, let’s see those points.

Final Model Evaluation:

Let’s now evaluate the model in terms of its assumptions. We should test that:

  • The error terms are normally distributed with mean approximately 0.
  • There is little correlation between the predictors.
  • Homoscedasticity, i.e. the ‘spread’ or ‘variance’ of the error term (y_true-y_pred) is constant.
# Error terms
c = [i for i in range(len(y_pred))]
fig = plt.figure()
plt.plot(c,y_test-y_pred, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('ytest-ypred', fontsize=16) # Y-label
plt.show()
Image by author: Error Terms

Plotting the error terms to understand the distribution.

fig = plt.figure()
sns.distplot((y_test-y_pred),bins=50)
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('y_test-y_pred', fontsize=18) # X-label
plt.ylabel('Index', fontsize=16) # Y-label
plt.show()
Image by author: Error distribution

Now it may look like that the mean is not 0, though compared to the scale of ‘Price’, -380 is not such a big number (see distribution below).

Image by author: Price Distribution

Multicollinearity:

predictors = ['carwidth', 'curbweight', 'enginesize', 
'enginelocation_rear', 'car_company_bmw', 'car_company_porsche']
cors = X.loc[:, list(predictors)].corr()
sns.heatmap(cors, annot=True)
plt.show()
Image by author: Heatmap to show multicollinearity among Predictors

Conclusion:

Though this is the most simple model we’ve built till now, the final predictors still seem to have high correlations. One can go ahead and remove some of these features, though that will affect the adjusted-r2 score significantly (you should try doing that).

Thus, for now, the final model consists of the 6 variables mentioned above.

Thank you for reading !!!

--

--