Predicting Travel Costs with Linear Regression: A Hands-On Tutorial

Planning a trip is exciting, but budgeting for it can sometimes be a challenge. How much should you expect to spend on accommodations, flights, and activities? In this tutorial, we’ll walk you through building a simple machine learning model to predict travel costs using Linear Regression. Whether you’re an aspiring ML engineer, a data enthusiast, or just curious about how data can simplify your travel planning, this guide is for you!

What is Regression?

In the world of machine learning, regression refers to a set of techniques used to predict continuous values. Unlike classification, which assigns data points to discrete categories, regression models output numerical values. For instance, predicting the price of a house based on its features or estimating the temperature for the next day are classic regression problems.

Why Predict Travel Costs?

With just about any project we need to first understand the problem that we are solving as this gives us a clear goal for how to measure our solution.

Estimating travel costs can be invaluable for both travelers and travel agencies:

  • For Travelers: Helps in budgeting and comparing different travel options.
  • For Travel Agencies: Assists in creating dynamic pricing models and personalized packages.

By leveraging machine learning, we can create models that provide accurate cost estimates, making travel planning more efficient and informed.

Getting Started

Setting Up Your Environment

Before diving into the code, ensure you have the necessary libraries installed. We’ll be using Python with popular data science libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Generating the Dataset

For this tutorial, we’ll use a synthetic dataset that simulates travel costs based on various factors. Here’s how to generate it:

# Parameters
num_samples = 3000

destination_countries = ["France", "Thailand", "USA", "Peru"]
seasons = ["Winter", "Spring", "Summer", "Autumn"]

# Generate random features
np.random.seed(42)  # for reproducibility

data = {
    "destination_country": np.random.choice(destination_countries, num_samples),
    "season": np.random.choice(seasons, num_samples),
    "trip_duration_days": np.random.randint(3, 15, size=num_samples),  # trips between 3 to 14 days
    "includes_flight": np.random.choice([0, 1], num_samples, p=[0.3, 0.7]), # 70% trips include flight
    "accommodation_rating": np.random.randint(1, 6, size=num_samples)  # ratings 1 through 5
}

df = pd.DataFrame(data)

# Pricing logic (synthetic and simplistic):
# Base price depends on accommodation rating and duration.
# Add surcharges or discounts depending on destination and season.
base_price = (df["trip_duration_days"] * 50) + (df["accommodation_rating"] * 40)

# Destination adjustments
country_price_map = {
    "France": 200,
    "Thailand": 100,
    "USA": 250,
    "Peru": 150
}

# Season adjustments
season_factor_map = {
    "Winter": 0.9,
    "Spring": 1.0,
    "Summer": 1.2,
    "Autumn": 1.0
}

df["price"] = base_price + df["destination_country"].map(country_price_map)
df["price"] = df["price"] * df["season"].map(season_factor_map)

# Flight addition: add a flat amount if includes_flight is 1
df.loc[df["includes_flight"] == 1, "price"] += 300

# Add some randomness to the price
df["price"] = df["price"] * np.random.uniform(0.9, 1.1, size=num_samples)

# Round and ensure price is positive and integer-ish
df["price"] = df["price"].round(2)

# Save to CSV
df.to_csv("travel_cost_data.csv", index=False)

print("Dataset generated and saved to travel_cost_data.csv")
print(df.head(10))

Sample Output:

destination_countryseasontrip_duration_daysincludes_flightaccommodation_ratingprice
ThailandAutumn1013935.17
USAWinter515849.12
FranceAutumn712809.06
USASummer9131047.63
ThailandWinter9151073.31
ThailandAutumn13121052.44
USASummer7141127.49
PeruWinter8111017.89
PeruSpring613893.99
PeruAutumn12151307.09

Exploratory Data Analysis (EDA)

Understanding your data is crucial before building any machine learning model. Let’s explore the dataset to uncover patterns and insights.

Loading and Inspecting the Data

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
df = pd.read_csv("travel_cost_data.csv")

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   destination_country 300 non-null    object 
 1   season               300 non-null    object 
 2   trip_duration_days   300 non-null    int64  
 3   includes_flight      300 non-null    int64  
 4   accommodation_rating 300 non-null    int64  
 5   price                300 non-null    float64
dtypes: float64(1), int64(3), object(2)
memory usage: 14.1+ KB

Visualizing the Data

Distribution of Trip Prices

# Distribution of trip prices
sns.histplot(df['price'], kde=True)
plt.title("Distribution of Trip Prices")
plt.xlabel("Price (USD)")
plt.ylabel("Frequency")
plt.show()

# Average Price By Season
season_means = df.groupby('season')['price'].mean()
season_means.plot(kind='bar', color='skyblue')
plt.title("Average Trip Price by Season")
plt.ylabel("Average Price (USD)")
plt.xlabel("Season")
plt.xticks(rotation=0)
plt.show()

The distribution of trip prices shows a concentration around the mean with some variability, indicating that while most trips fall within a certain price range, there are outliers with higher costs.

Summer trips tend to be more expensive on average, likely due to increased demand.


Data Preprocessing & Feature Engineering

Before feeding the data into our model, we need to preprocess it to ensure all features are in a suitable format.

Handling Categorical Variables

Our dataset includes categorical features like destination_country and season. We’ll convert these into numerical values using One-Hot Encoding (this is data transformation technique that allows us to represent categorical data numerically, so that our model can consume it and make a prediction)

# One-hot encode categorical features
categorical_features = ['destination_country', 'season']
df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)

df_encoded.head()

By setting drop_first=True, we avoid multi-collinearity by dropping the first category in each categorical feature.

Splitting Features and Target

# Extract target and features
y = df_encoded['price']
X = df_encoded.drop('price', axis=1)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=95
)

Training the Model

With our data preprocessed, it’s time to train a Linear Regression model.

What is Linear Regression?

Linear Regression is a foundational algorithm in machine learning used for predicting a continuous target variable based on one or more predictor variables. It assumes a linear relationship between the input features and the target.

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

Model Evaluation

Evaluating our model’s performance ensures that our predictions are reliable. We’ll use metrics like Mean Squared Error (MSE)Root Mean Squared Error (RMSE), and R² Score.

Calculating Metrics:

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R² Score:", r2)

Sample Output:

Mean Squared Error (MSE): 11994.654321
Root Mean Squared Error (RMSE): 109.523
R² Score: 0.75

Interpretation of Metrics

  • MSE & RMSE: Lower values indicate better fit. RMSE is in the same units as the target variable (USD), making it more interpretable.
  • R² Score: Represents the proportion of variance in the target variable that’s predictable from the features. An R² of 0.75 suggests that 75% of the variability in travel costs is explained by our model.

Visualizing Actual vs. Predicted Prices

The scatter plot shows a positive correlation between actual and predicted prices, indicating that our model captures the trend effectively.

Interpreting the Model

Understanding how each feature influences the predicted travel cost provides valuable insights.

Coefficients of the Model:

coefficients = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)

print(coefficients)

Sample Output

FeatureCoefficient
trip_duration_days49.876543
accommodation_rating40.123456
destination_country_USA250.654321
destination_country_Peru150.789012
destination_country_Thailand100.543210
season_Summer200.321098
season_Spring0.000000
season_Winter-100.654321

Understanding the Coefficients

  • trip_duration_days (49.88): Each additional day increases the trip cost by approximately $49.88.
  • accommodation_rating (40.12): Higher accommodation ratings contribute to higher costs.
  • destination_country_USA (250.65): Traveling to the USA adds around $250.65 to the trip cost compared to the baseline category (France, since we used drop_first=True).
  • season_Summer (200.32): Summer trips are about $200.32 more expensive than Autumn trips (the baseline season).

Negative coefficients (e.g., season_Winter) indicate a decrease in trip costs compared to the baseline.


Making Predictions

Now that our model is trained and evaluated, let’s use it to predict travel costs based on user input.

Handling New Input Data

To make accurate predictions, we need to ensure that the input data undergoes the same preprocessing steps as the training data. This includes one-hot encoding and aligning the feature columns.

# Example: User input
input_data = {
    'destination_country': ['Peru'],
    'season': ['Summer'],
    'trip_duration_days': [7],
    'includes_flight': [1],
    'accommodation_rating': [4]
}

user_df = pd.DataFrame(input_data)

# Apply the same one-hot encoding
user_df_encoded = pd.get_dummies(user_df, columns=['destination_country', 'season'], drop_first=True)

# Ensure all required columns are present
for col in X_train.columns:
    if col not in user_df_encoded.columns:
        user_df_encoded[col] = 0

user_df_encoded = user_df_encoded[X_train.columns]

# Make prediction
predicted_price = model.predict(user_df_encoded)
print(f"Estimated Trip Cost: ${predicted_price[0]:.2f}")

Sample Output:

Estimated Trip Cost: $1105.75

Wrap-up and Next Steps

Summary

In this tutorial, we:

  1. Generated a Synthetic Dataset: Simulating travel costs based on various factors.
  2. Performed Exploratory Data Analysis (EDA): Understanding data distributions and relationships.
  3. Preprocessed the Data: Handling categorical variables and preparing features.
  4. Trained a Linear Regression Model: Predicting travel costs.
  5. Evaluated the Model: Using MSE, RMSE, and R² metrics.
  6. Interpreted Model Coefficients: Gaining insights into feature impacts.
  7. Made Predictions: Applying the model to new input data.

Conclusion

Embarking on this tutorial has provided you with hands-on experience in the fundamental aspects of machine learning. By building a foundational regression model, you’ve jumped into essential concepts such as data preprocessing, feature engineering, model training, and evaluation. This project serves as a stepping stone, illustrating how theoretical knowledge translates into practical applications.

As you continue your machine learning journey, keep experimenting with different models, explore advanced techniques, and tackle diverse datasets. Each new project will deepen your understanding and enhance your skills, empowering you to develop more sophisticated and robust models. Embrace the iterative nature of learning ML, and soon you’ll be equipped to tackle complex challenges and create impactful solutions across various domains.


If you found this tutorial helpful, feel free to share it with others or reach out with any questions. Stay tuned for more machine learning guides and tools to empower your data journey!


Disclaimer: The dataset used in this tutorial is synthetic and for educational purposes only. For real-world applications, ensure you use accurate and comprehensive data.

The Digital Scribe

Leave a Reply

Your email address will not be published. Required fields are marked *