Identify Galaxies Using the Deep Learning Reference Stack
This article describes how users can detect and classify galaxies by their morphology using image processing and computer vision algorithms. We used data from the Sloan Digital Sky Survey and galaxy classification from the Galaxy Zoo project, along with the Deep Learning Reference Stack, a stack designed to be highly optimized and performant with Intel® Xeon® processors.
Data in science is always evolving and growing, and in the astronomy field it happens at an impressive pace. With technology improvements, new detectors, and better optics and instruments, Terabytes of information are available to study the universe.
The data available helps researchers ask important questions, solve problems, give support to hypotheses, and even prove theories. However, researchers must first process the data, make calculations, filter, sort, convert, and perform other operations which require computer processing. This process can become challenging and time consuming, making it important to have tools which make this task easier and faster to accomplish.
In astronomy, morphology is the study of the structural properties of galaxies. Galaxies exhibit certain characteristics or peculiarities. Features such as disk shapes, bulges, or bars for instance, give context to classifications. There are many classifications in use -- one example is the Hubble Classification. The morphology of a galaxy tells us about its evolution in time. Given the huge number of celestial objects to classify in the sky, it is necessary to find methods that make this a faster and an automated task. Many efforts have been made to accomplish such a task, involving the community with projects like Galaxy Zoo (Linott et al. 2008) in which project volunteers were asked to classify galaxy images gathered from the Sloan Digital Sky Survey (SDSS) into the following categories: elliptical, clockwise, anticlockwise, spiral, edge-on, star/don’t know, or merger.
The Open Source community contains solutions for many problems, which sometimes present extra challenges when attempting to configure and install in a system since they contain incomplete code or documentation, for example. This article presents an example solution for a particular use case, allowing users to focus on specific tasks such as proper model training and tuning, and get results faster using the Deep Learning Reference Stack. The Deep Learning Reference Stack has the following key features:
- Deep Learning (DL) framework installed and optimized for Intel platforms.
- Supports both PyTorch* and TensorFlow* frameworks.
- Containerized solution that is host OS agnostic.
- Compatible with cloud platforms.
The initiative is focused on delivering an end-to-end user experience for customers across multiple segments from cloud to edge by fully utilizing and exposing key value-added/high-demand Intel platform features from hardware up through the application. Our goal for this use case is to provide the scientific community with an example of how to get benefits of Intel hardware using Intel optimized software.
- Deep Learning Reference Stack v5.0
- PyTorch 1.3.0
We retrieved data from the SDSS Data Release 7 and selected the galaxy classification from the Galaxy Zoo project which provides morphological classifications for nearly a million galaxies. More specifically, we used Galaxy Zoo version 1, which classifies galaxies into six categories shown in Table 1, plus a combined spiral category. Two more classes were added to represent the debiased votes in the elliptical and spiral categories, for a total of nine classes.
We performed image preprocessing to train a ResNet-50 Convolutional Neural Network architecture (part of the PyTorch framework) with the Deep Learning Reference Stack.
Clockwise Spiral galaxy
Anticlockwise Spiral galaxy
Edge-on (other) galaxy
Star, Don’t Know
Combined Spiral Category
Debiased votes in Elliptical
Debiased votes in Spiral
Table 1. Features or Galaxy Classification used to train the model
Image Preprocessing and Training Configuration
The input to the model consists of 480x480 JPG images, all images with the principle object to analyze located with its center coinciding with the center of the image. We performed additional data augmentation preprocessing to incorporate more variations to the training process, such as:
- Center cropped to 224x224
- Resized to 45x45
- Random Rotation from 0 to 360°
- Random Horizontal and Vertical Flipping (50%)
- RGB color normalization by mean and standard deviation
Figure 1 depicts the typical convolutional neural network process to scan and get features from an image. Convolutional filters are obtained by the pooling and subsampling operations. The end of this process is a fully connected layer where results are classified and give the desired classification result.
Figure 1. Typical Convolutional Neural Network Structure
ResNet-50 (Figure 2) was selected because it is a model with good tradeoff between accuracy and inference time for image recognition problems (Canziani et al.). For this use case, we utilized the default layer parameters of the model, which is a fixed set of five epochs and a batch size of 96 images.
Figure 2. ResNet-50 Layer Block Example (image source)
Use case architecture
The Deep Learning Reference Stack trains and serves the image classifier using PyTorch. To train the model we imported the dataset, which consists of images from the Sloan Digital Sky Survey and a CSV file from the Galaxy Zoo project, transformed the data for our AI model, and finally created our image recognition model, which was trained using the data and images that were transformed, as shown in the following diagram.
Figure 3. Training Architecture
After training the model and saving it as a PyTorch (.pt) file, it can be loaded and perform inference on images containing galaxies. The result gives scores in percentage values indicating the likelihood of that galaxy to be in one of the given categories. The higher the number or the closer it is to 1, the more likely it is this galaxy belongs to that category. For example, a score of 0.91 in the elliptical category indicates the object analyzed by the model belongs to this morphology.
The inference architecture model is described in the following diagram.
Figure 4. Inference Architecture
Running Galaxy Recognition with the Deep Learning Reference Stack
This section describes how to run the Galaxy recognition use case.
- Intel® Xeon® Processor with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) enabled
- Familiarity with Docker*
- Familiarity with git and GitHub*
- Familiarity with Python*
- Pull the Deep Learning Reference Stack image to the local machine:
docker pull clearlinux/stacks-pytorch-mkl
- Run the container:
docker run --ipc=host -id --name <container_name> stacks-pytorch-mkl /bin/bash
- Install prerequisites inside the container:
- Save Galaxy images from SDSS and the weighted CSV file for training the model in the container:
docker cp <image.jpg> <container_name>:<training path>
- You may also need to install the following Python packages:
pip install numpy pandas pillow scikit-image
- Clone the use case code from the GitHub repository:
git clone https://github.com/intel/stacks-usecase.git cd stacks-usecase/galaxy_recognition/
- Train the model:
- Run inference with trained model:
python inference.py <id> <image.jpg>
<id>is an integer number to reference the result in an output CSV file.
<image.jpg>is the image of you want to infer, which must be in JPG format.
Following these steps, you will have a trained model capable of recognizing galaxies with the Deep Learning Reference Stack.
The Galaxy Recognition use case is an example of how the Deep Learning Reference Stack can be implemented without significant changes to solve an interesting problem for the science community. Using a simple script, it is possible to preprocess images with galaxies and train a Convolutional Neural Network such as ResNet-50 to create a model that identifies the galaxy category of the analyzed object. Other solutions can be implemented as well, such as your own models or a different framework like TensorFlow*. The code for training and inference can be used as a base to solve other problems that involve image recognition. Finally, it is worth mentioning that the Deep Learning Reference Stack is a highly optimized solution for Deep Learning on Intel® architecture. It can be complemented with other stacks such as the Database Reference Stack for a more robust solution.
We welcome your ideas for further enhancements.
For reference and troubleshooting, please refer to the stacks-usecase repository.