Deep Learning for Harvard Spectral Star Classification

Introduction

This project focuses on the classification of stars using data from the Gaia DR3 catalog, filtered to include stellar properties apparent magnitudes of photometric bands. The goal is to classify stars into their respective spectral types (O, B, A, F, G, K, M) based on these features.

Project Overview

We aim to develop a machine learning model capable of accurately classifying stars using their physical characteristics. The dataset includes stars of all spectral types with various features that will be leveraged to train a model and evaluate its performance on unseen data.

Model Architecture

For this task, we use a Artificial Neural Network (ANN) to handle the classification of stars into spectral types. The ANN is designed to:

  • Extract complex patterns from the numerical data (e.g., temperature, luminosity).
  • Handle imbalanced classes through techniques such as SMOTE to balance the training set.
  • Provide robust performance across all spectral classes.

Below is a basic idea of our classification model. Further details on the architecture of the ANN is given in the Architecture section of the report. We chose an ANN for this as the input features we used were all numerical and an ANN was best to find patterns in such data.

Model Architecture Diagram

Task Breakdown

  1. Data Preprocessing:
    • Handle missing data by dropping rows or columns and imputing missing values where necessary.
    • Normalize numerical features (e.g. the apparent magnitudes) to ensure uniform scale for model training.
    • Encode the target labels (spectral types) as numerical values.
  2. Balancing the Data:
    • Use SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance, ensuring that each spectral type is represented equally in the training data.
  3. Model Training:
    • Build and train the Artificial Neural Network (ANN) using the preprocessed data.
    • Evaluate the model’s performance using the test set to ensure it generalizes well to unseen data.
  4. Evaluation and Results:
    • Assess the model’s accuracy, precision, recall, and F1 score to determine its effectiveness in classifying stars into their respective spectral types.
    • Visualize the results using confusion matrices and other metrics.

Results

We used a Support Vector Machine (SVM) for the baseline model. Below is the confusion matrix. It received an accuracy of around 70%.

SVM Confusion Matrix

The accuracy result for your primary model was approximately 84% out performing our baseline SVM model by 14%. The model’s performance metrics on the 200th epoch: Train error= 0.1644, Train loss= 0.4544, Validation error= 0.1613, Validation loss: 0.4500. Below are the Error and Loss Curves for Training and Validation.

Training and Validation Curves

We used the predicted spectral class from test data to get temperatures for each star and created our very own HR diagram based on it, as it is a great way to showcase and validate our model predictions.

Generated HR Diagram

We can clearly see the similarities between our HR diagram and an official one in the image below. There is a line down the middle that resembles the main sequence, which is where most stars tend to be, highlighted in blue. There is a grouping of high luminosity low-temperature stars highlighted in red, which are the sub giants, and an even higher luminosity and lower temperature grouping, which are the giants highlighted in yellow. This shows us that our model was able to predict classes that closely resemble an actual HR diagram, thereby proving that the model is accurate.

Our HR Diagram vs Real HR Diagram