GSoC 2020 Raahul Singh
Just reading the title of the project was enough for me. A desire to work on problems like this is what got me into computer science and machine learning in the first place. To be able to work on problems in various fields of science and understand them in a way which wasn't even possible a few decades ago. There's tremendous potential for applying data-driven solutions to solar physics and I wish to do my part in it. I am two months short of having a full year's worth of professional experience in machine learning and am well acquainted with SunPy. This gives me confidence that I would be able to finish this project successfully and on time.
-
The first challenge with creating a Search Events Object would be matching the Active Regions in the Sunspotter data set with the Active Regions in various Solar Data catalogues, like HEK and Helio. We only need to worry about matching Sunspotter ARs to any catalogue that has NOAA AR numbers. Once we have the NOAA AR numbers for the ARs, we can access various other catalogues easily.
-
Matching catalogues (joining tables) when the fields are not exactly with the same value (i.e., allowing to specify what counts as a match) would be the next task. For example, for any given observation date, the HEK data for the ARs usually does not match exactly with the Sunspotter data. This is made evident by the lack of proper overlap in the various full disk plots that I have made in this notebook.
-
This matching of various ARs across Sunspotter data and the HEK will be done by first identifying common fields and then to match the rows, the algorithm described in the Tool for OPerations on Catalogues And Tables (TOPCAT) which is explained in the TOPCAT match algorithm reference will be used.
-
Next, we extend the
hek2vso
client and work towards better integrating it with FIDO. The idea is to create a single interface for getting the metadata and the various types of files associated with an observation. The long term idea is to get a unified interface where we can get data and meta-information about any event from a given observation date. One of the outcomes would be, the user being able to get event-specific information from HEK or other sources and use them in conjunction with the downloaded HMI and AIA data as is being done in the SunPy Gallery example on using AIA and HMI data together.
Another aspect of this Search Events Object would be a dictionary that would map names between different catalogues so that the same features.attribute
can be used on different catalogue searches (e.g., a flare is translated as FL in HEK and as the different flare tables on HELIO).
The object would be made subscript-able to facilitate easy access.
For the forecasting, the biggest challenge would be the data preprocessing and making it ready to be fed into the learning pipeline. For any machine learning project, the preprocessing plays a major role in the ability of the learning algorithm to learn from the data set. In the Sunspotter data set, there is a significant bias. The following is a list of the percentage of positive samples per class (not mutually exclusive):
Forecasting Parameter | Percentage of Positive samples |
---|---|
No-flare | 24.00% |
C1flr12hr | 11.37% |
C1flr24hr | 14.40% |
C5flr12hr | 8.12% |
C5flr24hr | 8.97% |
M1flr12hr | 7.22% |
M1flr24hr | 7.55% |
M5flr12hr | 7.56% |
M5flr24hr | 7.66% |
- For dealing with this problem, I propose using a method similar to the one described in the paper
Predicting Solar Flares Using a Novel Deep Convolutional Neural Network
Xuebao Li, Yanfang Zheng, et al 2020. - For the sake of conciseness, the main ideas are:
- It is pretty obvious that the number of M-level magnetogram samples is far less than that of No-flare/C-level magnetogram samples, which is consistent with the fact most ARs do not yield major flares in the period of any given 24 hr. This would result in a serious class-imbalance issue, which is a major problem in the field of machine learning.
- To deal with this, first ARs are categorized into three levels (i.e., No-flare, C, M). “Level = M” indicates that an AR yields at least one M-level flare; “Level = C” indicates that an AR yields at least one C-level flare but no M-level flares “Level = No-flare” indicates that an AR only yields microflares (weaker than C1.0 flares).
- We shall construct about 10 separate Cross-Validation data sets by the method of shuffle and split Cross-Validation based on AR segregation (AR segregation is, to balance out the classes, I will make sure that the ten Cross-validation splits that I make will have an almost equal number of C, M and No flare producing ARs.). First, we randomly shuffle the AR numbers in different levels of No-flare/C/M and then split the AR numbers at a ratio of around 80%:20% which would correspond to training and testing data respectively.
- The advantage of this method is that in each of the 10 data sets, not only do the samples in the testing data set not overlap with those in the training data set but also the ARs in the testing data set are disjoint from those in the training data set. We train and evaluate our model on these 10 separate training and testing data sets. We adopt the loss function calculated from the weighted cross-entropy loss.
I'll be modifying the code from the repository mentioned in the original paper to produce the 10 Cross-Validation splits. These modifications would mostly be for making the preprocessing step into a SunPy compliant class, which can be merged into the main repository.
In any data-driven problem, the challenge is to get the best possible performance with minimum resource consumption. In addition to this, the decision making (for example, classification) done by the machine learning algorithm should also be explainable. I, therefore, propose to use an Autoencoder to distil the information in the images and use this encoding in further forecasting algorithms.
-
Autoencoder is an unsupervised artificial neural network that learns how to efficiently compress and encode data then learns how to reconstruct the data back from the reduced encoded representation to a representation that is as close to the original input as possible. Autoencoder, by design, reduces data dimensions by learning how to ignore the noise in the data.
Here are some awesome resources on Autoencoders:
-
In addition to being used with a neural network, this lower-dimensional encoding can also be fed to simpler Machine Learning models. In my experience, the complexity and the representation of data plays a very important role in any learning task. If with simpler models, we can get results comparable to computationally heavy black box algorithms like neural networks, we should prefer them as they are easier to debug and explain.
The reasons why I would prefer reducing the dimensionality of knowledge using Autoencoders over creating more data by using algorithms like GANs to solve the class imbalance problem is:
- There would be no way to verify the effect of the AR metadata on the final classification. If we mix real observations with artificially produced data, we would still only have the observed metadata (from sources like HEK) for the real data. This would restrict us to simply using images for our forecasting.
- We would only produce more image data. This restricts us to image only algorithms, as the data set would remain imbalanced for all other algorithms.
- GANs are amongst the most computationally expensive algorithms and are not always stable enough to get diverse data production. There are various other problems like mode collapse, etc. which may cause major problems, quite unrelated to our main task of flare forecasting.
Different algorithms will consume a different amount of resources in terms of memory and training time. I shall be running all the algorithms locally on my machine equipped with an NVIDIA 1050Ti GPU. If need be, I'll migrate to Google Colab.
All the code written in this project shall be rigorously tested, following all of SunPy's testing standards and other general good practices. For the machine learning model, I shall be following industry standards for testing and debugging. I shall also take inspiration from popular blogs with some nice code references for unit testing and logging.
I plan on implementing various algorithms in order of increasing model complexity. These models will map various inputs to the following forecast parameters.
Forecast Parameter | Description |
---|---|
c1flr12hr |
at least one C1.0 or greater flare within 12 hr after the observation. |
c1flr24hr |
at least one C1.0 or greater flare within 24 hr after the observation. |
c5flr12hr |
at least one C5.0 or greater flare within 12 hr after the observation. |
c5flr24hr |
at least one C5.0 or greater flare within 24 hr after the observation. |
m1flr12hr |
at least one M1.0 or greater flare within 12 hr after the observation. |
m1flr24hr |
at least one M1.0 or greater flare within 24 hr after the observation. |
m5flr12hr |
at least one M5.0 or greater flare within 12 hr after the observation. |
m5flr24hr |
at least one M5.0 or greater flare within 24 hr after the observation. |
The accuracy of the model in predicting these parameters would be its accuracy in forecasting the solar weather.
This project will have the following deliverables:
-
A fully integrated Search Events Object, supported with rigorous testing. This will have all the features as described above, and more which will be decided on further consultation with the mentors during the community bonding period.
-
Multiple notebooks for the SunPy Examples gallery that highlight each algorithm implemented along with an interpretation of the results.
-
The best performing trained model, properly documented and tested, along with instructions on how to use it. This will be made inheriting a separate
ML Algorithm
abstract object so that more machine learning related work can be easily integrated in the future. -
A separate
Data Preprocessing
Object that will be used in this project and could be extended for future ML work. -
Instructions on how to retrain the model if necessary.
-
The basic layout for implementing the Search Events Object shall be designed at this time. I would take into account the work done in the FIDO project and work towards implementing the Search Events Object as per our requirement. The efforts on this project will also complement the FIDO metadata project with tools and uses cases.
-
I shall spend this time exploring the data set, discussing possible modifications to the architectures and hyperparameter tuning with my mentors, and continuing my contributions to SunPy.
-
I also plan on familiarising myself with the SMART algorithm and its outputs from the ground up using the IDL implementation. This will help me understand the various parameters and their generation.
-
Since Google has provided an extra week this year for community bonding, I shall use this time to recreate the ELO rating in python and compare it with other ratings (e.g., Glicko's, Bradley-Terry's).
Glicko is an implementation I found on GitHub that I can fork and use according to our data set.
-
As per the official timeline mentioned, this week will be spent visualising different types of data, the magnetograms, univariate, multivariate analysis of the SMART detection properties concerning both flare generation and the ELO complexity score. I shall be making all the plots in multiple notebooks for the SunPy examples gallery.
-
The statistical analysis of the ELO complexity score and its variation concerning with the production of flares will be analyzed. It is believed that the more complex an active region, the more likely it is to produce flares. The work done this week will help test this belief.
-
Further, I would begin working on the Search Events Object as has been described above.
- Having already tinkered with the data set, I have created a few basic plotting functions and a few helper functions to query the HEK database using HEKClient Here are a few random examples. The red squares have been plotted using the Sunspotter data and the blue rectangles have been plotted from the queried HEK data. All functions are in the above-mentioned link. The red squares have been overplotted on purpose as the Sunspotter data set lacks the information about the shape of the bounding boxes.
- This week shall be used to complete most of the Search Events Object. Following this, I wish to give
testing
at least four to five days to make it completely merge ready.
-
Having completed all the previous tasks, we are now ready to tackle the forecasting problem head-on. The third week shall be used in completing the data preprocessing as described in the previous section. This shall take the first half of the third week.
-
Next, I will implement various models to map the complexity scores directly to the flare observations. We shall not be considering the images as of yet.
-
These models shall serve as benchmarks for comparing against deep learning models.
-
All of the following will be implemented using Scikit-Learn and other ML libraries:
- Random Forest
- SVM
- XGBoost
-
These are all standard models. This is the part where I experiment with different linear algebra models mentioned in the problem statement. I do not expect the best result from any of them but they will help in determining the next course of action we take as we apply deep learning methods in the later weeks.
-
If need be, other ratings (e.g., Glicko's, Bradley-Terry's) will be used and the predictions on them will be compared against the ELO trained model predictions.
-
All the implementations will be accompanied by an analysis blog post on the nature of the algorithm, an analysis of the results and possible deductions about the performance metrics.
- This will be a buffer week to ensure :
- The Search Events Object is implemented, fully documented and well tested.
- We have all the Exploratory Data analysis plots for SunPy Examples Gallery.
- The basic Machine Learning model is up and running.
-
Week 5 shall be used to reimplement the best performing algorithm implemented during week 3 and 4, but this time taking into account the SMART properties along with the ELO complexity score.
-
The results obtained here will likely be less accurate taking into account the increase in the input complexity.
-
To combat this, Dimensionality Reduction algorithms like
Principal Component Analysis
shall be used to find out the most important combination of features. -
The best performing algorithm shall be trained and tested on this reduced data set. It is expected to give a boost in the prediction accuracy.
-
At this point, we move into the domain of Deep Learning.
-
For week 6, I shall be implementing a Deep Convolution Neural Network based on the paper,
Deep Learning-Based Solar Flare Forecasting Model. I. Results for Line-of-sight Magnetograms
Huang et al. -
We begin using the images of the Active Regions obtained from sunspotter and map them to the flare observation labels.
-
The reason for implementing a paper is that I get something that is relatively known to work. I can build on top of it, rather than start from scratch.
-
With 210692 images, the network may take a long time to train and we will need to train again every time we tweak the hyperparameters of the network, so I am allotting two weeks for this initial network implementation.
This shall be implemented using either the PyTorch or TensorFlow2.0 libraries, which shall be decided after discussing with the mentors.
-
Whereas all the SMART parameters can be treated as sensor inputs, the complexity score represents more of a belief or a confidence parameter.
-
The complexity score, when scaled to a range of [0,1] can be considered a probability score for the Active Region to produce a flare.
-
For week 8, I shall be implementing the best performing implementation of the CNN from the previous weeks, with a major augmentation. I shall be multiplying the output of the biggest hidden layer with scaled ELO score corresponding to the input image. This shall give the network information about the complexity and will allow the network to train accordingly.
-
Comparing the results with the un-augmented network, we shall have further evidence whether the ELO complexity score correlates to the flare production or not.
This shall be implemented using either the PyTorch or TensorFlow2.0 libraries, which shall be decided after discussing with the mentors.
-
Here I plan on implementing an original idea for a multichannel neural network, which would have the ability to take both the Sunspotter SMART detection values along with the corresponding images. I plan on making these different types of inputs compatible by first training an AutoEncoder network on the Active Region images to learn an effective lower-dimensional encoding for the images. This shall be concatenated and re-normalised with the processed SMART detection values and the complexity score to make the final feed-forward neural network that will learn the mapping.
-
Autoencoder will be like a
Neural Network
version of SMART which when given a particular image of an AR, will characterise it with some properties. It will give us a vector of distilled information from each image, but unlike SMART, it is a black box. We will not know what the values in that vector represent. -
I shall also retrain the best performing non-Deep Learning model from previous weeks to see if we can get comparable results from less computationally taxing algorithms.
This shall be a rather unchartered territory and I will give a full two weeks to implement this.
- The last two weeks of the project shall be used to summarise the results of all the various experiments based on the different algorithms.
- After selecting the best performing model, its performance shall be tested on SDO/HMI data.
- An extensive notebook shall be written detailing the use of the model and possible ways to tweak the hyperparameters.
- If time permits, as a side quest, I shall implement a neural network that directly maps the Active Region images with the ELO complexity rating. This shall help in automating the complexity prediction for magnetograms in the future.
- The final week shall be used to add final touches to the deliverables.
- Time zone: UTC+05:30
- GitHub handle: Raahul-Singh
- Riot: @raahulsingh:matrix.org
- University: Indian Institute of Information Technology, Sri City
- Major: Computer Science
- Current Year: 2nd year
- Programming Languages: Python, C++, C, Java(Basic knowledge)
- Contributions to SunPy:
- As of writing this proposal, I have 8 merged and 5 WIP PRs on SunPy repo.
Some prominent ones are :- Significant reducuction of the download time for the data set used in this project through the HelioViewer API.
- Rectangle coordinate parser to support multiple ways of specifying rectangular regions of interest.
- Integrating parfive==1.1rc2 with SunPy.
- Implementing Limb Darkening correction.
- Informing user when and what coordinate information may be missing when making a map.
- As of writing this proposal, I have 8 merged and 5 WIP PRs on SunPy repo.
I have opened 7 issues on the SunPy repository and have a cumulative of 25 contributions to SunPy
- Contributions to other repos:
- Parfive
- And other contributions to EinstienPy and ChiantiPy.
- I will be contributing to SunPy throughout the selection process and beyond.
No, This is my first time participating in GSoC.
No, I am fully focused on this project.
I don't have any other internship or work during this summer and have no plans for any vacations either.
I can work full time on the project and can give ~35-40 hours per week and more if required.
Yes, I am eligible to receive payments from Google.
-
Deep Learning-Based Solar Flare Forecasting Model. I. Results for Line-of-sight Magnetograms
Huang et al. -
Solar flare prediction using advanced feature extraction, machine learning and feature selection
Ahmed OW, Qahwaji RSR, Colak T, Higgins PA, Gallagher PTand Bloomfield DS (2013) Solar physics. 283(1): 157-175 -
Predicting Solar Flares Using a Novel Deep Convolutional Neural Network
Xuebao Li, Yanfang Zheng, et al 2020