This capstone project applies a full data science pipeline to a real-world business problem: predicting whether the SpaceX Falcon 9 first stage will land successfully after launch.
SpaceX advertises Falcon 9 rocket launches at approximately $62 million, compared to other providers who charge upward of $165 million. The key to this cost advantage is first-stage reusability. By accurately predicting whether the first stage will land, we can estimate the true cost of a launch β information that is extremely valuable for any company looking to compete with or bid against SpaceX.
Raw Data Sources
β
βββ SpaceX REST API
βββ Wikipedia Web Scraping
β
β
Data Processing (Pandas)
β
β
Exploratory Data Analysis
β
βββ SQL Analysis
βββ Matplotlib / Seaborn Visualizations
β
β
Interactive Analytics
β
βββ Folium Geospatial Map
βββ Plotly Dash Dashboard
β
β
Machine Learning Models
β
βββ Logistic Regression
βββ SVM
βββ Decision Tree
βββ KNN
β
β
Landing Success Prediction
IBM-Applied-Data-Science-Capstone/
β
βββ Module 1 - Introduction/
β βββ 1. Data Collection API Lab/
β β βββ data-collection-api-lab.ipynb
β βββ 2. Data Collection with Web Scraping lab/
β β βββ data-collection-with-web-scraping.ipynb
β βββ 3. Data Wrangling/
β βββ data-wrangling.ipynb
β
βββ Module 2 - Exploratory Data Analysis/
β βββ 1. EDA with SQL/
β β βββ eda-with-sql.ipynb
β βββ 2. EDA with Visualization Lab/
β βββ eda-with-visualization.ipynb
β
βββ Module 3 - Interactive Visual Analytics & Dashboards/
β βββ 1. Interactive Visual Analytics with Folium lab/
β β βββ interactive-visual-analytics-with-folium.ipynb
β βββ 2. Interactive Dashboard with Plotly Dash/
β βββ spacex_dash_app.py
β
βββ Module 4 - Predictive Analysis (Classification)/
β βββ SpaceX-Machine-Learning-Prediction.ipynb
β
βββ Final_Presentation_Avik.pdf
| Category | Tools / Libraries |
|---|---|
| Language | Python 3 |
| Data Collection | requests, BeautifulSoup |
| Data Manipulation | pandas, numpy |
| Data Visualization | matplotlib, seaborn |
| SQL / Database | sqlite3, ipython-sql |
| Interactive Maps | folium, MarkerCluster, MousePosition |
| Dashboard | Plotly Dash |
| Machine Learning | scikit-learn |
| ML Models | Logistic Regression, SVM, Decision Tree, KNN |
| Hyperparameter Tuning | GridSearchCV |
| Notebook Environment | Jupyter Notebook |
Historical Falcon 9 launch data was collected using the SpaceX public API:
- A
GETrequest was made to the SpaceX API endpoint to retrieve historical launch records. - The JSON response was parsed and converted into a Pandas DataFrame.
- Key features were selected; data was filtered to launches on or before November 13, 2020.
- Additional API calls using IDs from rockets, payloads, launchpads, and cores enriched the dataset with booster names, payload masses, orbit types, and core landing outcomes.
- Missing
PayloadMassvalues were imputed with the column mean. - The processed dataset was exported as
dataset_part_1.csv.
Falcon 9 historical launch records were also scraped from a Wikipedia page using BeautifulSoup:
- An HTTP GET request retrieved the raw HTML.
- The third HTML table containing launch records was identified and parsed.
- Column names, dates, payloads, booster versions, and landing outcomes were extracted row by row.
- The resulting DataFrame was exported as
spacex_web_scraped.csv.
- The dataset was inspected for missing values and data type consistency.
- Mission outcomes were converted into a binary training label (
Class):1for successful landing,0for unsuccessful. - "Bad outcomes" such as
False ASDS,None None,False Ocean,False RTLS, andNone ASDSwere identified and labeled as failures. - The overall landing success rate was calculated.
- The wrangled dataset was exported as
dataset_part_2.csv.
Using pandas, matplotlib, and seaborn, the following analyses were performed:
| Visualization | Key Insight |
|---|---|
| Flight Number vs. Launch Site | Higher flight numbers (later launches) trend toward more successes; experience improves reliability. |
| Payload Mass vs. Launch Site | VAFB SLC-4E did not handle payloads above ~10,000 kg; heavier payloads don't always mean failure. |
| Success Rate by Orbit Type | ES-L1, GEO, HEO, and SSO show 100% success; GTO shows the lowest rate (~51.9%). |
| Flight Number vs. Orbit Type | LEO success improves with flight number; GTO shows no such relationship. |
| Payload Mass vs. Orbit Type | Polar, LEO, and ISS orbits handle heavy payloads with good success; GTO is less predictable. |
| Launch Success Yearly Trend | Success rate grew from ~0% (2010β2013) to ~84% by 2020, with a notable leap during SpaceX's reusability breakthroughs (2015β2017). |
The SpaceX dataset was loaded into an SQLite database and queried using SQL magic (%sql). Key queries included:
- Task 1: Retrieved all 4 unique launch sites β
CCAFS LC-40,VAFB SLC-4E,KSC LC-39A,CCAFS SLC-40. - Task 2: Fetched 5 records for launch sites beginning with
'CCA'. - Task 3: Total payload mass for NASA (CRS) customers β 48,213 kg.
- Task 4: Average payload mass for
F9 v1.1booster β ~2,534.67 kg. - Task 5: First successful ground pad landing date β 2015-12-22.
- Task 6: Listed booster versions with successful drone ship landings and payload mass between 4,000β6,000 kg (
F9 FT B1022,F9 FT B1026,F9 FT B1021.2,F9 FT B1031.2). - Task 7: Counted mission outcomes β 98 successes, 1 failure (in flight).
- Task 8: Identified booster versions carrying maximum payload mass (multiple
F9 B5variants). - Task 9: Filtered drone ship failure outcomes in 2015 β months
01and04, both atCCAFS LC-40. - Task 10: Ranked landing outcome counts between 2010-06-04 and 2017-03-20 in descending order β "No attempt" (10) topped the list.
Three interactive Folium maps were built to analyze launch site geography:
Map 1 β Launch Sites Overview
All SpaceX launch sites were plotted on an interactive map centered at NASA Johnson Space Center. Each site was marked with a folium.Circle (1000m radius) and a labeled folium.Marker.
Map 2 β Launch Success/Failure Visualization
Each individual launch record was represented by a color-coded folium.Marker:
- π’ Green β Successful landing (Class = 1)
- π΄ Red β Failed landing (Class = 0)
A MarkerCluster was used to manage overlapping markers at the same coordinates, allowing intuitive site-level success rate assessment.
Map 3 β Proximity & Distance Analysis
Distances from each launch site to the nearest coastline, railway, highway, and city were calculated using a custom calculate_distance function. folium.PolyLine objects were drawn between the launch site and each proximity point, with distance labels (in km) displayed on the map.
A fully interactive web dashboard was built (spacex_dash_app.py) featuring:
| Component | Description |
|---|---|
| Launch Site Dropdown | Filters the dashboard by "All Sites" or a specific launch site |
| Pie Chart | Displays total launch success counts by site (or success/failure ratio for a selected site); KSC LC-39A had the highest success ratio at 76.9% |
| Payload Range Slider | Filters scatter plot data by payload mass (0β10,000 kg) |
| Scatter Plot | Plots Payload Mass vs. Launch Outcome, color-coded by Booster Version Category β reveals correlations between payload mass and mission success |
A machine learning pipeline was built to classify whether the Falcon 9 first stage would land successfully.
- Target variable
Yextracted from theClasscolumn. - Features
Xstandardized usingStandardScaler(). - Data split: 80% training / 20% test.
| Model | Best Hyperparameters | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| Logistic Regression | C=0.01, penalty='l2', solver='lbfgs' |
0.8464 | 0.8333 |
| Support Vector Machine (SVM) | C=1.0, gammaβ0.032, kernel='sigmoid' |
0.8482 | 0.8333 |
| Decision Tree Classifier | criterion='entropy', max_depth=10, max_features='sqrt' |
0.8625 | 0.8333 |
| K-Nearest Neighbors (KNN) | algorithm='auto', n_neighbors=10, p=1 |
0.8482 | 0.8333 |
All models used GridSearchCV with 10-fold cross-validation for hyperparameter tuning.
The Decision Tree Classifier achieved the highest validation accuracy of 0.8625. All four models converged to the same test set accuracy of 0.8333.
The confusion matrix for the best model revealed:
- True Positives (correctly predicted: landed): 12
- True Negatives (correctly predicted: did not land): 3
- False Positives: 3
- False Negatives: 0
-
Launch success has improved significantly over time. The overall success rate grew from nearly 0% in 2010β2013 to approximately 84% by 2020, with a major leap coinciding with SpaceX's reusability breakthroughs between 2015β2017.
-
Higher flight numbers correlate with higher success rates. More experienced launches, especially at CCAFS SLC-40, show a clear improvement in landing outcomes.
-
KSC LC-39A is the highest-performing launch site, with a success ratio of ~76.9% among the four sites.
-
Orbit type strongly influences landing success. ES-L1, GEO, HEO, and SSO achieved 100% success rates in the dataset, while GTO had the lowest at ~51.9%.
-
VAFB SLC-4E does not support heavy payloads, with no recorded launches exceeding ~10,000 kg.
-
The Decision Tree Classifier was the best-performing model with a validation accuracy of 0.8625 and a test accuracy of 0.8333, shared across all four evaluated models.
-
All SpaceX launch sites share key geographical traits β proximity to coastlines and equatorial latitudes β which helps optimize trajectory efficiency and booster recovery.
| Lab | Link |
|---|---|
| Data Collection β SpaceX API | Notebook |
| Data Collection β Web Scraping | Notebook |
| Data Wrangling | Notebook |
| EDA with SQL | Notebook |
| EDA with Visualization | Notebook |
| Interactive Map with Folium | Notebook |
| Plotly Dash Dashboard | Script |
| Machine Learning Prediction | Notebook |
The full project presentation (Final_Presentation_Avik.pdf) is included in this repository and covers the executive summary, methodology, EDA insights, interactive analytics demos, and predictive analysis results end-to-end.
This project was completed as part of the IBM Data Science Professional Certificate on Coursera.