Planning a trip is exciting, but budgeting for it can sometimes be a challenge. How much should you expect to spend on accommodations, flights, and activities? In this tutorial, we’ll walk you through building a simple machine learning model to predict travel costs using Linear Regression. Whether you’re an aspiring ML engineer, a data enthusiast, or just curious about how data can simplify your travel planning, this guide is for you!
What is Regression?
In the world of machine learning, regression refers to a set of techniques used to predict continuous values. Unlike classification, which assigns data points to discrete categories, regression models output numerical values. For instance, predicting the price of a house based on its features or estimating the temperature for the next day are classic regression problems.
Why Predict Travel Costs?
With just about any project we need to first understand the problem that we are solving as this gives us a clear goal for how to measure our solution.
Estimating travel costs can be invaluable for both travelers and travel agencies:
- For Travelers: Helps in budgeting and comparing different travel options.
- For Travel Agencies: Assists in creating dynamic pricing models and personalized packages.
By leveraging machine learning, we can create models that provide accurate cost estimates, making travel planning more efficient and informed.
Getting Started
Setting Up Your Environment
Before diving into the code, ensure you have the necessary libraries installed. We’ll be using Python with popular data science libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Generating the Dataset
For this tutorial, we’ll use a synthetic dataset that simulates travel costs based on various factors. Here’s how to generate it:
# Parameters
num_samples = 3000
destination_countries = ["France", "Thailand", "USA", "Peru"]
seasons = ["Winter", "Spring", "Summer", "Autumn"]
# Generate random features
np.random.seed(42) # for reproducibility
data = {
"destination_country": np.random.choice(destination_countries, num_samples),
"season": np.random.choice(seasons, num_samples),
"trip_duration_days": np.random.randint(3, 15, size=num_samples), # trips between 3 to 14 days
"includes_flight": np.random.choice([0, 1], num_samples, p=[0.3, 0.7]), # 70% trips include flight
"accommodation_rating": np.random.randint(1, 6, size=num_samples) # ratings 1 through 5
}
df = pd.DataFrame(data)
# Pricing logic (synthetic and simplistic):
# Base price depends on accommodation rating and duration.
# Add surcharges or discounts depending on destination and season.
base_price = (df["trip_duration_days"] * 50) + (df["accommodation_rating"] * 40)
# Destination adjustments
country_price_map = {
"France": 200,
"Thailand": 100,
"USA": 250,
"Peru": 150
}
# Season adjustments
season_factor_map = {
"Winter": 0.9,
"Spring": 1.0,
"Summer": 1.2,
"Autumn": 1.0
}
df["price"] = base_price + df["destination_country"].map(country_price_map)
df["price"] = df["price"] * df["season"].map(season_factor_map)
# Flight addition: add a flat amount if includes_flight is 1
df.loc[df["includes_flight"] == 1, "price"] += 300
# Add some randomness to the price
df["price"] = df["price"] * np.random.uniform(0.9, 1.1, size=num_samples)
# Round and ensure price is positive and integer-ish
df["price"] = df["price"].round(2)
# Save to CSV
df.to_csv("travel_cost_data.csv", index=False)
print("Dataset generated and saved to travel_cost_data.csv")
print(df.head(10))
Sample Output:
destination_country | season | trip_duration_days | includes_flight | accommodation_rating | price |
---|---|---|---|---|---|
Thailand | Autumn | 10 | 1 | 3 | 935.17 |
USA | Winter | 5 | 1 | 5 | 849.12 |
France | Autumn | 7 | 1 | 2 | 809.06 |
USA | Summer | 9 | 1 | 3 | 1047.63 |
Thailand | Winter | 9 | 1 | 5 | 1073.31 |
Thailand | Autumn | 13 | 1 | 2 | 1052.44 |
USA | Summer | 7 | 1 | 4 | 1127.49 |
Peru | Winter | 8 | 1 | 1 | 1017.89 |
Peru | Spring | 6 | 1 | 3 | 893.99 |
Peru | Autumn | 12 | 1 | 5 | 1307.09 |
Exploratory Data Analysis (EDA)
Understanding your data is crucial before building any machine learning model. Let’s explore the dataset to uncover patterns and insights.
Loading and Inspecting the Data
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
df = pd.read_csv("travel_cost_data.csv")
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 destination_country 300 non-null object
1 season 300 non-null object
2 trip_duration_days 300 non-null int64
3 includes_flight 300 non-null int64
4 accommodation_rating 300 non-null int64
5 price 300 non-null float64
dtypes: float64(1), int64(3), object(2)
memory usage: 14.1+ KB
Visualizing the Data
Distribution of Trip Prices
# Distribution of trip prices
sns.histplot(df['price'], kde=True)
plt.title("Distribution of Trip Prices")
plt.xlabel("Price (USD)")
plt.ylabel("Frequency")
plt.show()
# Average Price By Season
season_means = df.groupby('season')['price'].mean()
season_means.plot(kind='bar', color='skyblue')
plt.title("Average Trip Price by Season")
plt.ylabel("Average Price (USD)")
plt.xlabel("Season")
plt.xticks(rotation=0)
plt.show()

The distribution of trip prices shows a concentration around the mean with some variability, indicating that while most trips fall within a certain price range, there are outliers with higher costs.

Summer trips tend to be more expensive on average, likely due to increased demand.
Data Preprocessing & Feature Engineering
Before feeding the data into our model, we need to preprocess it to ensure all features are in a suitable format.
Handling Categorical Variables
Our dataset includes categorical features like destination_country
and season
. We’ll convert these into numerical values using One-Hot Encoding (this is data transformation technique that allows us to represent categorical data numerically, so that our model can consume it and make a prediction)
# One-hot encode categorical features
categorical_features = ['destination_country', 'season']
df_encoded = pd.get_dummies(df, columns=categorical_features, drop_first=True)
df_encoded.head()
By setting drop_first=True
, we avoid multi-collinearity by dropping the first category in each categorical feature.
Splitting Features and Target
# Extract target and features
y = df_encoded['price']
X = df_encoded.drop('price', axis=1)
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=95
)
Training the Model
With our data preprocessed, it’s time to train a Linear Regression model.
What is Linear Regression?
Linear Regression is a foundational algorithm in machine learning used for predicting a continuous target variable based on one or more predictor variables. It assumes a linear relationship between the input features and the target.
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
Model Evaluation
Evaluating our model’s performance ensures that our predictions are reliable. We’ll use metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² Score.
Calculating Metrics:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R² Score:", r2)
Sample Output:
Mean Squared Error (MSE): 11994.654321
Root Mean Squared Error (RMSE): 109.523
R² Score: 0.75
Interpretation of Metrics
- MSE & RMSE: Lower values indicate better fit. RMSE is in the same units as the target variable (USD), making it more interpretable.
- R² Score: Represents the proportion of variance in the target variable that’s predictable from the features. An R² of 0.75 suggests that 75% of the variability in travel costs is explained by our model.
Visualizing Actual vs. Predicted Prices

The scatter plot shows a positive correlation between actual and predicted prices, indicating that our model captures the trend effectively.
Interpreting the Model
Understanding how each feature influences the predicted travel cost provides valuable insights.
Coefficients of the Model:
coefficients = pd.DataFrame({
'Feature': X_train.columns,
'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)
print(coefficients)
Sample Output
Feature | Coefficient |
---|---|
trip_duration_days | 49.876543 |
accommodation_rating | 40.123456 |
destination_country_USA | 250.654321 |
destination_country_Peru | 150.789012 |
destination_country_Thailand | 100.543210 |
season_Summer | 200.321098 |
season_Spring | 0.000000 |
season_Winter | -100.654321 |
Understanding the Coefficients
- trip_duration_days (49.88): Each additional day increases the trip cost by approximately $49.88.
- accommodation_rating (40.12): Higher accommodation ratings contribute to higher costs.
- destination_country_USA (250.65): Traveling to the USA adds around $250.65 to the trip cost compared to the baseline category (France, since we used
drop_first=True
). - season_Summer (200.32): Summer trips are about $200.32 more expensive than Autumn trips (the baseline season).
Negative coefficients (e.g., season_Winter) indicate a decrease in trip costs compared to the baseline.
Making Predictions
Now that our model is trained and evaluated, let’s use it to predict travel costs based on user input.
Handling New Input Data
To make accurate predictions, we need to ensure that the input data undergoes the same preprocessing steps as the training data. This includes one-hot encoding and aligning the feature columns.
# Example: User input
input_data = {
'destination_country': ['Peru'],
'season': ['Summer'],
'trip_duration_days': [7],
'includes_flight': [1],
'accommodation_rating': [4]
}
user_df = pd.DataFrame(input_data)
# Apply the same one-hot encoding
user_df_encoded = pd.get_dummies(user_df, columns=['destination_country', 'season'], drop_first=True)
# Ensure all required columns are present
for col in X_train.columns:
if col not in user_df_encoded.columns:
user_df_encoded[col] = 0
user_df_encoded = user_df_encoded[X_train.columns]
# Make prediction
predicted_price = model.predict(user_df_encoded)
print(f"Estimated Trip Cost: ${predicted_price[0]:.2f}")
Sample Output:
Estimated Trip Cost: $1105.75
Wrap-up and Next Steps
Summary
In this tutorial, we:
- Generated a Synthetic Dataset: Simulating travel costs based on various factors.
- Performed Exploratory Data Analysis (EDA): Understanding data distributions and relationships.
- Preprocessed the Data: Handling categorical variables and preparing features.
- Trained a Linear Regression Model: Predicting travel costs.
- Evaluated the Model: Using MSE, RMSE, and R² metrics.
- Interpreted Model Coefficients: Gaining insights into feature impacts.
- Made Predictions: Applying the model to new input data.
Conclusion
Embarking on this tutorial has provided you with hands-on experience in the fundamental aspects of machine learning. By building a foundational regression model, you’ve jumped into essential concepts such as data preprocessing, feature engineering, model training, and evaluation. This project serves as a stepping stone, illustrating how theoretical knowledge translates into practical applications.
As you continue your machine learning journey, keep experimenting with different models, explore advanced techniques, and tackle diverse datasets. Each new project will deepen your understanding and enhance your skills, empowering you to develop more sophisticated and robust models. Embrace the iterative nature of learning ML, and soon you’ll be equipped to tackle complex challenges and create impactful solutions across various domains.
If you found this tutorial helpful, feel free to share it with others or reach out with any questions. Stay tuned for more machine learning guides and tools to empower your data journey!
Disclaimer: The dataset used in this tutorial is synthetic and for educational purposes only. For real-world applications, ensure you use accurate and comprehensive data.
Leave a Reply