Skip to content

Avik-Das-567/IBM-Applied-Data-Science-Capstone

Repository files navigation

SpaceX Falcon 9 Landing Prediction | IBM Applied Data Science Capstone

πŸ“Œ Project Overview

This capstone project applies a full data science pipeline to a real-world business problem: predicting whether the SpaceX Falcon 9 first stage will land successfully after launch.

SpaceX advertises Falcon 9 rocket launches at approximately $62 million, compared to other providers who charge upward of $165 million. The key to this cost advantage is first-stage reusability. By accurately predicting whether the first stage will land, we can estimate the true cost of a launch β€” information that is extremely valuable for any company looking to compete with or bid against SpaceX.


πŸ—οΈ Project Architecture

Raw Data Sources
β”‚
β”œβ”€β”€ SpaceX REST API
β”œβ”€β”€ Wikipedia Web Scraping
β”‚
↓
Data Processing (Pandas)
β”‚
↓
Exploratory Data Analysis
β”‚
β”œβ”€β”€ SQL Analysis
β”œβ”€β”€ Matplotlib / Seaborn Visualizations
β”‚
↓
Interactive Analytics
β”‚
β”œβ”€β”€ Folium Geospatial Map
β”œβ”€β”€ Plotly Dash Dashboard
β”‚
↓
Machine Learning Models
β”‚
β”œβ”€β”€ Logistic Regression
β”œβ”€β”€ SVM
β”œβ”€β”€ Decision Tree
└── KNN
β”‚
↓
Landing Success Prediction

πŸ—‚οΈ Repository Structure

IBM-Applied-Data-Science-Capstone/
β”‚
β”œβ”€β”€ Module 1 - Introduction/
β”‚   β”œβ”€β”€ 1. Data Collection API Lab/
β”‚   β”‚   └── data-collection-api-lab.ipynb
β”‚   β”œβ”€β”€ 2. Data Collection with Web Scraping lab/
β”‚   β”‚   └── data-collection-with-web-scraping.ipynb
β”‚   └── 3. Data Wrangling/
β”‚       └── data-wrangling.ipynb
β”‚
β”œβ”€β”€ Module 2 - Exploratory Data Analysis/
β”‚   β”œβ”€β”€ 1. EDA with SQL/
β”‚   β”‚   └── eda-with-sql.ipynb
β”‚   └── 2. EDA with Visualization Lab/
β”‚       └── eda-with-visualization.ipynb
β”‚
β”œβ”€β”€ Module 3 - Interactive Visual Analytics & Dashboards/
β”‚   β”œβ”€β”€ 1. Interactive Visual Analytics with Folium lab/
β”‚   β”‚   └── interactive-visual-analytics-with-folium.ipynb
β”‚   └── 2. Interactive Dashboard with Plotly Dash/
β”‚       └── spacex_dash_app.py
β”‚
β”œβ”€β”€ Module 4 - Predictive Analysis (Classification)/
β”‚   └── SpaceX-Machine-Learning-Prediction.ipynb
β”‚
└── Final_Presentation_Avik.pdf

🧰 Technologies & Libraries Used

Category Tools / Libraries
Language Python 3
Data Collection requests, BeautifulSoup
Data Manipulation pandas, numpy
Data Visualization matplotlib, seaborn
SQL / Database sqlite3, ipython-sql
Interactive Maps folium, MarkerCluster, MousePosition
Dashboard Plotly Dash
Machine Learning scikit-learn
ML Models Logistic Regression, SVM, Decision Tree, KNN
Hyperparameter Tuning GridSearchCV
Notebook Environment Jupyter Notebook

πŸ“¦ Modules & Methodology

Module 1 β€” Data Collection & Wrangling

πŸ”Ή Data Collection via SpaceX REST API

Historical Falcon 9 launch data was collected using the SpaceX public API:

  • A GET request was made to the SpaceX API endpoint to retrieve historical launch records.
  • The JSON response was parsed and converted into a Pandas DataFrame.
  • Key features were selected; data was filtered to launches on or before November 13, 2020.
  • Additional API calls using IDs from rockets, payloads, launchpads, and cores enriched the dataset with booster names, payload masses, orbit types, and core landing outcomes.
  • Missing PayloadMass values were imputed with the column mean.
  • The processed dataset was exported as dataset_part_1.csv.

πŸ”Ή Data Collection via Web Scraping

Falcon 9 historical launch records were also scraped from a Wikipedia page using BeautifulSoup:

  • An HTTP GET request retrieved the raw HTML.
  • The third HTML table containing launch records was identified and parsed.
  • Column names, dates, payloads, booster versions, and landing outcomes were extracted row by row.
  • The resulting DataFrame was exported as spacex_web_scraped.csv.

πŸ”Ή Data Wrangling

  • The dataset was inspected for missing values and data type consistency.
  • Mission outcomes were converted into a binary training label (Class): 1 for successful landing, 0 for unsuccessful.
  • "Bad outcomes" such as False ASDS, None None, False Ocean, False RTLS, and None ASDS were identified and labeled as failures.
  • The overall landing success rate was calculated.
  • The wrangled dataset was exported as dataset_part_2.csv.

Module 2 β€” Exploratory Data Analysis (EDA)

πŸ”Ή EDA with Data Visualization

Using pandas, matplotlib, and seaborn, the following analyses were performed:

Visualization Key Insight
Flight Number vs. Launch Site Higher flight numbers (later launches) trend toward more successes; experience improves reliability.
Payload Mass vs. Launch Site VAFB SLC-4E did not handle payloads above ~10,000 kg; heavier payloads don't always mean failure.
Success Rate by Orbit Type ES-L1, GEO, HEO, and SSO show 100% success; GTO shows the lowest rate (~51.9%).
Flight Number vs. Orbit Type LEO success improves with flight number; GTO shows no such relationship.
Payload Mass vs. Orbit Type Polar, LEO, and ISS orbits handle heavy payloads with good success; GTO is less predictable.
Launch Success Yearly Trend Success rate grew from ~0% (2010–2013) to ~84% by 2020, with a notable leap during SpaceX's reusability breakthroughs (2015–2017).

πŸ”Ή EDA with SQL

The SpaceX dataset was loaded into an SQLite database and queried using SQL magic (%sql). Key queries included:

  • Task 1: Retrieved all 4 unique launch sites β€” CCAFS LC-40, VAFB SLC-4E, KSC LC-39A, CCAFS SLC-40.
  • Task 2: Fetched 5 records for launch sites beginning with 'CCA'.
  • Task 3: Total payload mass for NASA (CRS) customers β€” 48,213 kg.
  • Task 4: Average payload mass for F9 v1.1 booster β€” ~2,534.67 kg.
  • Task 5: First successful ground pad landing date β€” 2015-12-22.
  • Task 6: Listed booster versions with successful drone ship landings and payload mass between 4,000–6,000 kg (F9 FT B1022, F9 FT B1026, F9 FT B1021.2, F9 FT B1031.2).
  • Task 7: Counted mission outcomes β€” 98 successes, 1 failure (in flight).
  • Task 8: Identified booster versions carrying maximum payload mass (multiple F9 B5 variants).
  • Task 9: Filtered drone ship failure outcomes in 2015 β€” months 01 and 04, both at CCAFS LC-40.
  • Task 10: Ranked landing outcome counts between 2010-06-04 and 2017-03-20 in descending order β€” "No attempt" (10) topped the list.

Module 3 β€” Interactive Visual Analytics & Dashboard

πŸ”Ή Interactive Map with Folium

Three interactive Folium maps were built to analyze launch site geography:

Map 1 β€” Launch Sites Overview All SpaceX launch sites were plotted on an interactive map centered at NASA Johnson Space Center. Each site was marked with a folium.Circle (1000m radius) and a labeled folium.Marker.

Map 2 β€” Launch Success/Failure Visualization Each individual launch record was represented by a color-coded folium.Marker:

  • 🟒 Green β†’ Successful landing (Class = 1)
  • πŸ”΄ Red β†’ Failed landing (Class = 0)

A MarkerCluster was used to manage overlapping markers at the same coordinates, allowing intuitive site-level success rate assessment.

Map 3 β€” Proximity & Distance Analysis Distances from each launch site to the nearest coastline, railway, highway, and city were calculated using a custom calculate_distance function. folium.PolyLine objects were drawn between the launch site and each proximity point, with distance labels (in km) displayed on the map.

πŸ”Ή Interactive Dashboard with Plotly Dash

A fully interactive web dashboard was built (spacex_dash_app.py) featuring:

Component Description
Launch Site Dropdown Filters the dashboard by "All Sites" or a specific launch site
Pie Chart Displays total launch success counts by site (or success/failure ratio for a selected site); KSC LC-39A had the highest success ratio at 76.9%
Payload Range Slider Filters scatter plot data by payload mass (0–10,000 kg)
Scatter Plot Plots Payload Mass vs. Launch Outcome, color-coded by Booster Version Category β€” reveals correlations between payload mass and mission success

Module 4 β€” Predictive Analysis (Classification)

A machine learning pipeline was built to classify whether the Falcon 9 first stage would land successfully.

πŸ”Ή Data Preparation

  • Target variable Y extracted from the Class column.
  • Features X standardized using StandardScaler().
  • Data split: 80% training / 20% test.

πŸ”Ή Models Evaluated

Model Best Hyperparameters Validation Accuracy Test Accuracy
Logistic Regression C=0.01, penalty='l2', solver='lbfgs' 0.8464 0.8333
Support Vector Machine (SVM) C=1.0, gammaβ‰ˆ0.032, kernel='sigmoid' 0.8482 0.8333
Decision Tree Classifier criterion='entropy', max_depth=10, max_features='sqrt' 0.8625 0.8333
K-Nearest Neighbors (KNN) algorithm='auto', n_neighbors=10, p=1 0.8482 0.8333

All models used GridSearchCV with 10-fold cross-validation for hyperparameter tuning.

πŸ”Ή Best Model: Decision Tree Classifier

The Decision Tree Classifier achieved the highest validation accuracy of 0.8625. All four models converged to the same test set accuracy of 0.8333.

The confusion matrix for the best model revealed:

  • True Positives (correctly predicted: landed): 12
  • True Negatives (correctly predicted: did not land): 3
  • False Positives: 3
  • False Negatives: 0

πŸ“Š Key Findings & Conclusions

  1. Launch success has improved significantly over time. The overall success rate grew from nearly 0% in 2010–2013 to approximately 84% by 2020, with a major leap coinciding with SpaceX's reusability breakthroughs between 2015–2017.

  2. Higher flight numbers correlate with higher success rates. More experienced launches, especially at CCAFS SLC-40, show a clear improvement in landing outcomes.

  3. KSC LC-39A is the highest-performing launch site, with a success ratio of ~76.9% among the four sites.

  4. Orbit type strongly influences landing success. ES-L1, GEO, HEO, and SSO achieved 100% success rates in the dataset, while GTO had the lowest at ~51.9%.

  5. VAFB SLC-4E does not support heavy payloads, with no recorded launches exceeding ~10,000 kg.

  6. The Decision Tree Classifier was the best-performing model with a validation accuracy of 0.8625 and a test accuracy of 0.8333, shared across all four evaluated models.

  7. All SpaceX launch sites share key geographical traits β€” proximity to coastlines and equatorial latitudes β€” which helps optimize trajectory efficiency and booster recovery.


πŸ”— Notebook Links

Lab Link
Data Collection β€” SpaceX API Notebook
Data Collection β€” Web Scraping Notebook
Data Wrangling Notebook
EDA with SQL Notebook
EDA with Visualization Notebook
Interactive Map with Folium Notebook
Plotly Dash Dashboard Script
Machine Learning Prediction Notebook

πŸ“„ Final Presentation

The full project presentation (Final_Presentation_Avik.pdf) is included in this repository and covers the executive summary, methodology, EDA insights, interactive analytics demos, and predictive analysis results end-to-end.


This project was completed as part of the IBM Data Science Professional Certificate on Coursera.

About

End-to-end data science project predicting SpaceX Falcon 9 first-stage landing success using API data collection, web scraping, EDA, interactive dashboards, and classification models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors