GSoC 2018 gulshan

Organisation: OpenAstronomy

Sub-Organization: SunPy

Project: Remote Data in Sunpy

Student Information

Name : Gulshan Kumar
E-Mail : gulshankumar1210iiit@gmail.com
Phone: +91-9490415562
Time Zone : UTC+05:30
IRC Handle : gulshan_mittal
Github handle : blackeye
Blog: MyBlog
Blog RSS feed : RSS Feed

University Information

University : International Institute of Information Technology, Hyderabad
Major: Computer Science
Current Year: Second Year
Expected Graduation Date: 2021
Programme: Bachelor of Technology (B.Tech) in Computer Science and MS by Research in Computer Science

Open Source and Development Experience

Before contributing to Sunpy I have not had much Open Source experience but I have experience working on a team project, especially involving python.

Made a Shopping Portal in Python using flask similar to OLX where you can buy or sell products. Link to Project
Made a Proxy server and a client-server model in Python using HTTP libraries. Link1 Link2
Made a Bomberman game using Python with OOP Principles. Link to Project
Made a 2-D Pacman Killer game using OpenGl. Link to Project
Made a 3-D Legend of Zelda game Wind Walker using OpenGl. Link to Repo
Made an Extreme Tic Toe Bot using Python. Link to Bot
Build a State Restoring Quiz Game in Ruby on Rails. Modify that Project and conduct Cache-in IIIT-H Felicity Buzz Event where a number of people registered on it. Link to Project

Development Experience: Mobile App Developer Intern at Statwig, T-Hub, IIIT Hyderabad.Made a Project Toad App similar to Google Calendar. Amazon Alexa developer and Google Assistant developer.

GitHub Handle : blackeye

Operating System Experience: Currently, I use Ubuntu. I also use Fedora, Linux Mint, and Windows.

Apart from the above projects, I am also proficient in C, C++, javascript.

Pull Request & Issue

hgs_to_hcc(heliogcoord, heliocframe) closed
hgs_to_hcc(heliogcoord, heliocframe) : Convert from Heliographic Stonyhurst to Heliograpic Carrington merged
Add Heliocentric --> Heliocentric transformation Issue closed

Project Proposal

Project: Remote Data in SunPy

Mentors : Stuart Mumford, Will Barnes, David Pérez-Suárez

Abstract

The aim of the Sunpy is to provide data analysis and helps in the field of solar physics by providing tools and functions which minimize the efforts of users in performing their tasks related to solar physics.The sunpy package needs access to the data files on the remote (HTTP) server for the data analysis used in former.Since SunPy has no control over the data on the servers, and the files on the servers may be replaced with different files with the same name, so there is a need for a module which ensures that the remote data is downloaded and cached before it is used on the client system. The project provides a way to validate that the retrieved file has the expected hash and also provide ways for users to override this hash i.e., redownload the data if they are aware of changes on the remote server.

The project will contain the remote_data_manager class, which will provide users a mechanism to download data from remote servers, versioning of data, caching and multiple mirror functionality of download function. The user will use the API to access the remote_data_manager. Apart from the few functions mentioned above, the API will provide the user with several options such as getting the latest version of a file means file on the basis of the timestamp in the cache, getting a file on the basis of its hash value etc.

Motivation

Functions in sunpy are going to need data files present on a remote server to work and as of now, there is no proper system to control remote data so there is a need to design a system which could meet with the requirements of remote data in sunpy. There are some issues in sunpy which push me to design and build the remote data manager.

In download_file function which is currently used to download data for tutorials taken from astropy doesn't support the override mechanism to download data.Also it doesn'tmeet with multiple mirror requirement if data is not present on one server. Motivation from Issue 1809.
There is some part of AIA response function which is varying with time. Right now the instrument info (e.g. reflection coefficients of mirrors, a gain of CCD, the quantum efficiency of CCD) are read straight from the .genx files in SSW. When a user creates an instance of the response class, he/she can choose a path to his/her SSW install and the version of the instrument file he/she want (the default is 6). As of now, the problem how to either pull down this info from somewhere or store it (and version it somehow) has not been solved. The calculated response function is up to date depends on the instrument teams keeping the files in SSW up to date, the users local SSW install is up to date and the user selecting the newest version (since older versions are kept in SSW). This is not a good idea to purposed.Motivation from Issue 1897.
Timeout error if data from the remote server exceeds the time limit.

Project Goals

Part1: Evaluation of the methods proposed and ensuring the best approach for storing a local cache of data. There should be an implementation of a basic cache and download system, including test and documentation.

Part2: Design of a simple and functional API and have worked with my mentors and community for better understanding and implementing the design efficiently.The working prototype of this API also have to be done including its tests.

Deliverables

Cache and download system along with its test and documentation.
Simple and functional API and its working prototype. Written examples for the gallery of how to use the functionality.
Developer documentation for the understanding of functionality. All the work should be done in a manner that Pull request should be merged after review and feedback.

Detailed Description

Requirements elicitation

The first requirement is data is downloaded and cached to $HOME/sunpy/data... when first needed. It is the folder that made automatically when you build the sunpy project. (Inside the data folder currently there are fts and fits files and sample_data folder which also contains fits, txt, pha files.)
The second requirement is do some kind of validation that will ensure that the data is transferred correctly. It can be done using the cryptographic hash function. As the project is dealing with random data on remote (HTTP) servers, there is a need to have a vigorous version-control data management system like git. The priority of this interesting requirement will be considered high in this project.
The Mechanism by which users can be allowed to re-download data.
The download code supports multiple mirrors.

Design and Implementation

Remote Data Manager

In this project remote_data_manager is the core class which would handle the caching operations, download features, file handling, checksumming, validations etc. The require, skip_hash_check, replace_file and other helper function of this class will perform the above operations. The require function will work as a decorator and will be used to build an efficient caching mechanism. The skip_hash_check and replace_file function will be implemented as a context manager. The helper functions that I propose for API includes compare_hash, get_cached_urls, delete_cache, get_file which will be used to access remote_data_manager.

Caching

We need to cache the downloaded data in a effective manner. We will use require function of remote_data_manager which maintains a record of the cache. Here the require function work as a decorator. The code inside sunpy would ask for a given file name at a given url with given hash. Define a function named myfetch() that need some data :

@remote_data_manager.require(name='file1',
                             urls=('https://server1/file1.fits', 'http://server2/file1.fits'),
                             shasum='1245645343545334')
def myfetch():
    filename = remote_data_manager.get_file('file1')

This adds the function name to the cache, and the files, when the code is run the downloader goes out and gets the file, verifies that it matches the provided hash using compare_hash()function of remote_data_managerand it puts it in a folder which probably has the function name in it, and then gives it to the function on request. If a file already presents in the cache then it simply gave the path to the file without downloading.

Note: comapre_hash() function will be implemented in remote_data_manager class .

The Cache will be maintained on a disk in the form of JSON (javascript object notation). The given JSON below describe how the cache database will look like, how it allows to override the hash check and replace the file.

{
  "function_name": "myfetch",
  "fileaname": "file1",
  "default_hash": "1245645343545343",
  "fetch_hash": "1245645343545343",
  "list_hash": [
    {
      "hash": "1245645343545343",
      "size": "10",
      "created_at": "2018-3-15 06:00:00",
      "modified_at": "2018-3-15 06:15:00",
      "urls": ["https://server1/file1.fits", "https://server2/file1.fits"]
    },
    ".......... other file caches belong to the same name"
  ]
}

Here, fetch_hash will be fetched by get_hash function or hash attached with the file in remote_data_manager class, default_hash I assume it is the hash which I know for comparing or it is the hardcoded hash. List_cash maintains the list of a cache of files having same shared key.

Note: We are also caching the older version of data.

Storage and Download

Storage: The file will be saved in the folder $HOME/sunpy/data/… and the name of the file determined by hash or time stamp ( created at field in the JSON of cache). The modified_at field in JSON of cache will maintain the individuality of the cached filename.
- Filename by hashing: The name of the file will be the hash of filename plus modified time. Here modified time is taken into consideration to avoid the situation when the file is changed on the remote server and user redownload it.
```
import hashlib as hash
Cached filename = hash.md5('orignal_filename' + modified_at).hexdigest()
```
- Filename by timestamp: The modified_at field in JSON of the cache will be added to the original name of the file after the downloading.
```
Cached filename = original_filename + '_' + modified_at
```
Download: For downloading data we are using remote_data_manager which uses the downloading function, for example, download_file or implementation of new function can be made. In require function as stated above will have a timeout parameter, show_progress bar (default value true in astropy download function) parameter as well.
For storing the cache efficiently compare_hash is made proficient and hash value of two comparing files will be stored (memoize) so that there is no need to calculate the hash value repeatedly.

Functions in Remote data manager

`skip_hash_check`

A function which is used during downloading data.It is a user overriding thing if the data has changed on the remote server or something else would happen.
Implemented as a context manager.

Skip hash sum check:

with remote_data_manager.skip_hash_check():
    myfetch()

`replace_file`

User overriding function to download a different file if he knows there is a newer version available.
Implemented as a context manager.

Replace file:

with remote_data_manager.replace_file(name='file1',
                                          shasum='1245645343545343',
                                          url='http://myserver/file1.fits'):
    myfetch()

Minute: Any override would effectively create new entries, so we can link it to the original version by a shared key. Filename can act as a shared key between the original version and the overridden version.

The functions which help in working prototype of API

`compare_hash(file1, file2)`

Used to compare hash values of two files efficiently.

Note: For efficiency we put a check first if the file size is different or not. It will save time for comparing hashing of big files.

def compare_hash(file1,file2):
    if file1_size != file2_size:
        return different file = 1
    else:
        remote_data_manager.get_hash(file1) != remote_data_manager.get_hash(file2)
        return different file = 1
    return same file = 0

`get_cached_urls()`

It provides the list of URLs in the cache database used for looking the files which are present in the cache.

def get_cached_urls():
    return the cached urls

`delete_cache(file)`

It deletes the cached files along with URLs in cached database.

def delete_cache(file):
    '''delete the cache'''
    return

`get_file(file)`

It provides the file name along with the path of the file. A association used will make an efficient function like objects which give the file by hash or time stamp.

def get_file(file):
    return filename determined by the hash function or timestamp

Note : remote data manager is a complex function and may require some more supporting function for e.g valid_url(implemented in sunpy) , get_file_size, get_hash etc.

Hashing for validation

Library used: hashlib

In python hash object of different bits are available for e.g sha1(), sha224(), sha256(), sha384(),sha512(), MD5 . We will use sha1() if we need security because it will sufficient for comparing the validation of file and also not too big. if security value is not so high then MD5sum will be used because it is more memory efficient and faster to compute.

import hashlib
m = hashlib.sha1()      '''Secured SHA-160 object'''
     #or
m = hashlib.md5()       '''Not secured but fast'''
def calculate_hash_func(file):
    return hash value of file  '''reading file data upto some buffer'''

Important: Every function which uses remote data will have a hash attached to it to access the remote data.

Note: The efficient implementation of above-proposed design will be in the following way:

Firstly, the file is downloaded from the remote server by getting URL of the data source.
Secondly, the file is cached and database of cache will be maintained in JSON format.
Use of exception raises for error handling will be done in the following manner:
1. If the download_file function is taking a long time and timeout occurs, an exception will be thrown using try and except statement.
2. The URLs given in the URL's argument in require function are invalid, a URLError will be raised using urllib2 i.e. urllib2.URLError.
3. The client system doesn't have enough space for the downloaded file to cache, an OSError exception derived from EnvironmentError will be raised.

Testing and documentation

Documentation and writing tests will be done simultaneously along with each function proposed above. So, completion of each function will involve writing code for implementation, writing tests for that feature, and documenting that feature.

Timeline

Time Period	Plan
April 23, 2018 - May 13, 2018 (Community Bonding Period)	Read documentation and get more familiar with the code base. Discuss with mentors and get a final idea of how to approach the project. Get familiar with the urllib, HTTP library , JSON, pytest, Astropy download_file function, AIA response, context manager, FIDO. Read code and get more familiar with the caching mechanism and try to get an idea of what challenges could arise while implementing the caching mechanism. This is important because caching mechanism should be efficient. Discuss with the mentors and get the idea of proposed caching mechanism and also the storage of downloaded data. Overview of the design and made a functional diagram to implement things
	Evaluation1 work starts
May 13, 2018 - May 19, 2018 (1 Week)	Discuss with mentors and set up the structure of cache database: Structure would be modular and ensuring it is efficient. Implement the function for downloading data . Writing documentation and basic test case to check download function. Make a Pull Request.
May 20, 2018 - May 26, 2018 (1 Week)	Integration with master if no issues. Implement `require` function : In remote data manager build caching mechanism and all other functions that support it. Integrate with download function. Writing documentation and test to check the cache of download data. Make a Pull Request for review .
May 27, 2018 - June 3, 2018 (1 Week)	Integration with master if no issues . Implement `get_file` and `compare_hash`function : Function to get file name as well as path to the will be implemented. Writing documentation and basic test case to check function. Make a Pull Request for review .
June 4, 2018 - June 10, 2018 (1 Week)	Self Review of Documentation and test cases: It is important to move to next phase. Fix Bug (if any) come from review or any other advice given by mentor . Prepared things already before the starting of Evaluation 1.
June 11, 2018 - June 15, 2018 (Evaluation 1/Buffer Period)	Evaluation 1 deliverables: Evaluation of the methods proposed and ensuring the best approach for storing a local cache of data. Implementation of a basic cache and download system , including test and documentation. Continue to work on API.
	Evaluation1 ends Evaluation2 work starts
June 15, 2018 - June 27, 2018 (2 Weeks)	Guidance from Mentors to implement the API for overriding download data. Implement `skip_hash_check` and `replace_file` function : Allows user to re-download data . Write documentation for the function.
June 27, 2018 - July 02, 2018 (1 Week)	Write the test cases for `skip_hash_check` and `replace_file` function. Make a Pull request for review. Integrate with master if no issues. If needed attach hash to above functions. Self review again the test and documentation because it is the base for 3rd and 4th requirement.
July 03 , 2018 - July 08, 2018 (1 Week)	Fix Bug upon self review( if present) Make a Pull request for review. Fix bug if found upon mentor review or any other advice given by mentor. Prepared things before the starting of Evaluation 2.
July 09, 2018 - July 13, 2018 (Evaluation 2/Buffer Period)	Evaluation 2 deliverables: Designed a simple and functional API. Made a working prototype of API support re - download of data and multiple mirrors, including tests. Implement the requested changes if still pending
	Evaluation2 ends Final Evaluation work starts
July 14 , 2018 - July 21, 2018 (1 Week)	Implement `get_cached_urls` and `delete_cache`function. Documentation and write test cases for `get_cached_urls` and `delete_cache` . Make a Pull request for review. Integrate with master if no issues.
July 22 , 2018 - August 05, 2018 (2 Weeks)	Write Examples for the gallery of how to use the functionality: It includes to test the download function by download a file from remote server and then test the replace or re-download functionality of `remote data manager` . File will be replaced if data changes on remote server or if file is download from new server. Evaluation of `get_cached_urls` & `delete_cache` by checking the file sizes and write their examples. Make a Pull request for review.
August 06 , 2018 - August 14, 2018 (Students Submit Code and Evaluations)	Clean up code. Improve Documentation(Developer) Resolve merge conflicts (if any) and any pending work will be addressed in this time. All previous Pull request Merged after review and feedback.
August 14 , 2018 - August 21, 2018	Mentors submit final student evaluations.
August 22	Final results of Google Summer of Code 2018 announced.

Important: I will still work on issues of sunpy till 18th April before community bonding period starts and I will also be active on RIOT and for updating the mentors during the coding period I will also write weekly blog.

Software packages to be used

Language: Python
Libraries and modules: HTTP client libraries, Checksumming and Caches, sunpy.util.progressbar, json, urllib, contextlib.

How I will successfully complete the project

This project interests me a lot and also fits my current skill sets. Also, I have worked on projects which have strict deadlines and high dependencies on other teammates' progress. This makes me confident of completing this project efficiently and smoothly.

Regularity in work and update mentors about work till date is very important for any projects and I will definitely follow these guidelines for my project. I will also seek guidance if I am stuck at on particular problem. I will make Pull request regularly so that the mentors can keep track of my progress. Also, I will try to make the commit messages and documentation clear and concise to help anyone who works with the code in the future.

For a better idea of my approach, I will spend time on the project before the coding period starts so that I can hit the ground running as soon as the coding period starts.

Even after the project ends, I will be available if anyone has any questions regarding my code.

Benefits to the Community

Access to data files on remote (HTTP) server.
Data Available from the cache if the internet connection is not here and the file is present in the cache.
Provide a version control system of remote data so that one version of SunPy always gives the same answer.

GSoC

Have you participated previously in GSoC? When? With which project?

I have not participated in GSoC before. This is the first time that I would be participating in GSoC.

Are you also applying to other projects?

No. This is the only project and SunPy is the only organization that I have applied for.

Commitment

I don't have any other internships or work ( I don't plan on having any ) for the summer. I don't have any plans to go on vacation either.

My classes for the new semester will begin around August 1, but I would still be able to give sufficient time for the project as the academic load is very less during the initial few weeks of the semester. I will be able to spare 40-42 hours for the project per week easily.

Also, because my summer vacation starts on May 1, I will start working on the project early so that I can try to complete the project well before the deadline ( around 2-3 weeks before the deadline ). This will also ensure that any extra unforeseen and time-consuming challenges will be taken care of ( there are also buffer periods to handle this ).

Also, SunPy is the only organization and this project is the only project that I have applied for.

Eligibility

Yes, I am eligible to receive payments from Google. For any queries, clarifications or further explanations, feel free to contact me at gulshankumar1210iiit@gmail.com.