Medical Cost Prediction Using Linear Regression | Nandita Pore
Introduction:
Medical costs can often be a major concern for individuals and families. Predicting these costs can have significant implications for insurance companies, healthcare providers, and policy makers. In this blog post, we will delve into the world of medical cost prediction using linear regression. We’ll walk you through the entire process, from loading the dataset to evaluating the model’s performance, all while providing code examples at each step.
Dataset Link:https: Medical Cost Dataset
Notebook Link: Medical Cost Prediction using Linear Regression
LinkedIn Profile: Nandita Pore
Kaggle Profile: Nandita Pore
- Importing Essential Libraries
Let’s start by loading the medical cost dataset and taking a quick look at its structure.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Of course, here’s a concise explanation of each library:
Pandas (import pandas as pd): Pandas helps manage and analyze structured data efficiently. It’s used to load, explore, and preprocess datasets, presenting data in tabular form.
Matplotlib (import matplotlib.pyplot as plt): Matplotlib supports creating various types of graphs and charts. It’s employed here for visualizing relationships between variables, like age and medical charges.
NumPy (import numpy as np): NumPy is essential for numerical computations and array operations. It’s used to generate arrays, perform calculations, and handle numerical data.
Scikit-Learn (from sklearn.model_selection import train_test_split): Scikit-Learn provides tools for machine learning tasks. The train_test_split
function splits data into training and testing sets, aiding model evaluation.
Scikit-Learn’s LinearRegression (from sklearn.linear_model import LinearRegression): Scikit-Learn’s LinearRegression
class enables the creation and training of linear regression models. It's used to build the predictive model for medical costs.
Scikit-Learn’s Metrics (from sklearn.metrics import mean_squared_error, r2_score): Scikit-Learn’s metrics offer ways to measure model performance. mean_squared_error
and r2_score
assess how well the model's predictions align with actual data.
These libraries collectively empower us to handle data, create models, visualize insights, and evaluate results, making the journey of predicting medical costs using linear regression both efficient and effective.
2. Loading the Dataset
df = pd.read_csv('/kaggle/input/medical-cost-dataset/medical_cost.csv')
pd.read_csv(...)
: This is a Pandas function that reads data from a CSV file.'/kaggle/input/medical-cost-dataset/medical_cost.csv'
: This is the file path to the CSV dataset file.df
: This is a Pandas DataFrame variable where the loaded data is stored for analysis and manipulation.
3. Preprocessing the Dataset
Before we can use the dataset for building a linear regression model, we need to preprocess it. This includes converting categorical variables into numerical values and splitting the data into features (X) and target (y) variables.
# Convert 'smoker' column to numerical values (binary encoding)
df['smoker'] = df['smoker'].map({'yes': 1, 'no': 0})
# Prepare the data
X = df[['age', 'bmi', 'children', 'smoker']]
y = df['charges']
4. Splitting the Data
Next, we’ll split the dataset into training and testing sets. This allows us to train the model on one portion of the data and evaluate its performance on another.
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
5. Building and Training the Linear Regression Model
Now, it’s time to create the linear regression model and train it using the training data.
from sklearn.linear_model import LinearRegression
# Create a linear regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
6. Predicting the Charges
Let’s predict the medical charges based on the age variable while keeping other variables constant.
# Predict charges based on the age variable
age_range = np.linspace(min(X['age']), max(X['age']), 100).reshape(-1, 1)
constant_vars = np.mean(X[['bmi', 'children', 'smoker']], axis=0).values.reshape(1, -1)
X_constant = np.hstack((age_range, np.tile(constant_vars, (age_range.shape[0], 1))))
predicted_charges = model.predict(X_constant)
age_range
: Creates an array of 100 evenly spaced values within the range of ages in the dataset. This is done usingnp.linspace(...)
.constant_vars
: Calculates the mean values of 'bmi', 'children', and 'smoker' variables. These will be kept constant for prediction. The result is reshaped into a row array usingnp.mean(...)
andreshape(...)
.X_constant
: Combines the 'age_range' with repeated 'constant_vars' to form a matrix of input features for prediction. This is done usingnp.hstack(...)
andnp.tile(...)
.predicted_charges
: Uses the trained linear regression model (model
) to predict medical charges based on the combined input matrix (X_constant
).
7. Visualizing Results
We’ll visualize the actual charges and the predicted line using a scatter plot.
# Create a scatter plot of actual vs. predicted charges
plt.figure(figsize=(10, 6))
plt.scatter(X_test['age'], y_test, label="Actual Charges", alpha=0.7)
plt.plot(age_range, predicted_charges, color='red', label="Predicted Line")
plt.xlabel("Age")
plt.ylabel("Charges")
plt.title("Actual vs. Predicted Charges")
plt.legend()
plt.show()
8. Evaluating the Model
To assess the model’s performance, we’ll calculate the Mean Squared Error (MSE) and the R-squared value.
# Evaluate the model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
Mean Squared Error: 33981653.95019776
R-squared: 0.7811147722517886
In this analysis of predicting medical costs using linear regression, the trained model demonstrated promising performance in capturing the relationship between age and medical charges. The model's predictive power was evaluated using the Mean Squared Error (MSE) and R-squared metrics.
The Mean Squared Error (MSE) of approximately 33,981,653 indicates the average squared difference between the predicted and actual medical charges. Lower MSE values indicate better predictive accuracy, and while the value obtained here is not extremely low, it suggests that the model is capturing a significant portion of the variance in medical charges.
The R-squared value of approximately 0.781 suggests that around 78.11% of the variance in medical charges can be explained by the age variable when keeping other variables constant. R-squared is a measure of how well the model fits the data, with values closer to 1 indicating a better fit. An R-squared of 0.781 indicates a relatively strong relationship between age and medical charges.
In conclusion, this linear regression analysis highlights the considerable influence of age on medical charges while accounting for other factors like BMI, number of children, and smoking habits. The model's performance, as indicated by the MSE and R-squared, suggests that age is a significant predictor of medical costs. However, further refinement and consideration of additional variables may lead to even more accurate predictions. This analysis lays the groundwork for future explorations into more comprehensive predictive models for medical costs.