GSoC 2018 Yash Jain
Mentors : Stuart Mumford, David Pérez-Suárez, Will Barnes
-
Name: Yash Jain
-
Time Zone: IST (UTC+5:30)
-
Chat handle: yashkgp
-
Github id: yashkgp
-
Email: yashjainjain1704@gmail.com
-
Blog: medium/@yashjainjain1704
-
University: Indian Institute of Technology, Kharagpur
-
Major: Integrated MS Course in Mathematics and Computing
-
Current Year: 2nd year (4th semester ongoing)
-
Expected Graduation cum Post-graduation: 2021
Hello, I am Yash Jain, a second year undergraduate student at IIT Kharagpur, India. I'm pursuing a degree in Mathematics and Computing. I code on my Ubuntu Desktop using vim as I think the availability of a large number of plugins for vim makes its customizable to suit my needs. I am comfortable with a lot of programming languages but I specifically like Python because of its intuitive interface and very broad spectrum of usage from simple scripting to web applications and scientific uses.
Here is a list of all my merged and unmerged contributions to SunPy in a chronological order:
-
Merged
Fixes Documentation changes due to NumPy 1.14 # 2404 : Due to NumPy 1.14 some of the doc tests failed. This PR tried to fix them. -
Closed
Rhessi returns database files of multiple months.#2418 : This PR was a feature request that enabled Rhessi Fido queries to return database files for multiple months. -
Merged
Adds filters for AstropyDeprecationWarning for 0.8 #2476 : This PR explicitly caughtAstropyDepreiationWarning
on the import of representation module and silenced it for Sunpy0.8. -
Merged
Replaces astropy-helper with sphinx-astropy #2494 : This Pr splits outs the sphinx-related functionality from astropy-helpers to a dedicated sphinx-astropy package and reformatsconf.py
to make it more succinct. -
Merged
Adds Test for TRACEMap #2504 : This PR increases the test coverage by adding source test for trace maps. -
Open
Adds Power Spectrum example for a timeseries #2496 : This adds an example that estimates Power Spectrum Density of a TimeSeries Object. -
Open
Adds figure test for Mapcubeanimation #2546 :Updated PR #2356 by adding figure test for Mapcubeanimator to it.
This project deals with creating a entirely new functionality for SunPy named remote_data_manager
, which will provide users a mechanism to download data from remote(HTTP) servers and various functionalities related to it (versioning of data, caching, checksumming etc).
This project is going to be designed and implemented under the assumptions that SunPy has no control over the data on the remote (HTTP) servers, and that files on the servers may be replaced with different files with the same name.
remote_data_manager
will, therefore, have to validate the retrieved files have the expected hashes while providing ways for users to override this hash if they are aware of changes on the remote server and want to purposely replace that file.
Some functionality in SunPy or in affiliated packages needs access to data files on remote (HTTP) servers. As SunPy integrates more and more instruments and functionalities, there is a need to have a robust version control data management system. Some of the issues that are forcing us to design and implement remote data manager are:-
- A part of AIA instrument info is time-dependent and keeps changing regularly. For now, the instrument information like reflection coefficients of mirrors, the gain of CCD, the quantum efficiency of CCD are read directly from the .genx files in SolarSoftWare. Whenever a user creates an instance of the response class, he/she have to choose a path to his/her SSW and install the version of the instrument file that he/she wants. So, the calculated response function is up to date depends on: _ The instrument teams keeping the files in SSW up to date. _ The user's local SSW install being up to date. * The user selecting the newest version (since older versions are also kept in SSW)
This is definitely not a long-term solution.
- Currently SunPy uses AstroPy's
download_file
to download remote data for tutorials like hmi_synoptic_maps.One of the few problems that AstroPy'sdownload_file
possess is that it doesn't version the data it downloads.It doesn't provide an override mechanism to redownload data whenever data changes on the remote server (deliberately due to calibration change etc.). This could be a really useful feature if we are tying SHA hash to our source code.
With these 2 issues stated, I have only scratched the surface, there are possibly many more uses of remote_data_manager
that one can think of.
In a broad sense, remote_data_manger
will have the following functionalities:-
-
remote_data_manger
will download and cache the data to$HOME/sunpy/data/...
when its first needed. -
remote_data_manger
will do some kind of validation that will ensure that the data is transferred correctly (SHA or MD5 hashes). -
remote_data_manger
will provide a mechanism by which users can re-download data whenever data changes on the remote server. -
remote_data_manager
will support multiple mirrors.
The above mentioned functionalities will be achieved by the following functions:-
def require (remote_url, mirror_urls,shahash,filename,timeout,show_progress=True):
"""
Parameters
----------
remote_url : str
The URL of the file to download
mirror_urls : list of str, optional
These mirror urls will be used whenever there's a problem
with the remote url.
show_progress : bool, optional
Whether to display a progress bar during the download
timeout : float,optional
The timeout, in seconds.If None, timeout will be read from
configuration settings.
hash : str
Hash code with which the downloaded file will
be compared with.
filename : str
the local name of the downloaded file.
Raises
------
urllib2.URLError
Whenever there's a problem getting the remote file.
OSError
Whenever there's not enough space to download the file
"""
This function will work as a decorator and will have the following functionalities:-
-
It will accept
remote_url
as an argument and download the file from theremote_url
-
If there is a problem with the
remote_url
, it will use themirror_urls
to download the file. -
It will also cache the downloaded data and add the function's name to the cache.
-
It will verify the downloaded data with the provided hash.
-
If the file is already present in the cache, it will do nothing.
-
The file will be saved in the directory
$HOME/sunpy/data/file_name/…
and with the name of the file determined by the it's hash. -
Database of cache will be maintained in a JSON file.
A glimpse of proposed cache database :
{
“file_name” : “stuartsfile”
"function_name" : "myfunction",
“default _version” : ”1245645343545343”
“current_version” : ”1245645343545343”
“total_size” : “82”
"versions" : [
{
“hash” : “1245645343545343”
“size” : “40”
“time_stamp” : “2017-11-31 23:59:59”
“urls” : [
“https://server1/file1.fits”
“http://server2/file1.fits”
]
}
{
“hash” : “1362155212253213”
“size” : “42”
“time_stamp” : “2017-12-31 23:59:59”
“urls” : [
”http://myserver/file1.fits”
]
}
]
}
This JSON file is made in accordance with Issue #1939
where,
file_name
will act as a shared key between all the versions of a file .
current_version
is the version of the file that will be returned by get
function.
default Version
is the version of the file that is compatible with our code (the file whose SHA hash is hard coded to the source code ).
By default, both current_version
and default_version
will point to the same file
def get(filename):
"""
Parameters
----------
filename : str
Returns
-------
local_path : str
returns the local path to the current_version of the filename,
whose name is determined by it's hash.
"""
This function will search for the filename argument in the cache and will return the local path to the current_version of the filename .
Define a function that needs remote data :
@remote_data_manager.require(remote_url='https://server1/file1.fits',
mirror_urls=['http://server2/file1.fits'],
shasum='1245645343545343',file_name = 'stuartsfile')
def myfunction():
filename = remote_data_manager.get('stuartsfile')
This will add the function's name to the cache, when the code is run the downloader goes out and gets the file, verifies that it matches the provided hash and it puts it in the folder $HOME/sunpy/data/stuartsfile/…
, and then gives it to the function on request.
@contextlib.contextmanager
def skip_hash_check(timeout,show_progress=True):
"""
Parameters
----------
show_progress : bool, optional
Whether to display a progress bar during the download
timeout : float ,optional
The timeout, in seconds.If None, timeout will be read from
configuration settings.
"""
This function allows users to download the same file from the same server even if the hash has changed(i.e. the file has changed on the remote host). This function is a context manager, and hence will be used with awith
-as
statement.
with remote_data_manager.skip_hash_check():
myfunction()
This will download a new version of the same file from the same server and add it to cache. skip_hash_check
will also change current_version
in the cache database to point to the newly downloaded file, so when get
function is called inside myfunction
it returns the path to the newly downloaded file. At last, after myfunction
is executed, skip_hash_check
will again set current_version
to default.
@contextlib.contextmanager
def replace_file(remote_url, mirror urls, shahash,filename,timeout,show_progress=True):
"""
Parameters
----------
remote_url : str
The URL of the new file to download
mirror urls : list of str, optional
These mirror urls will be used whenever there's a problem
with the remote url.
show_progress : bool, optional
Whether to display a progress bar during the download
timeout : float, optional
The timeout, in seconds.If None, timeout will be read from
configuration settings.
shahash : str
SHA hash code with which the new downloaded file will
be compared with.
filename : str
the file name.
Raises
------
urllib2.URLError
Whenever there's a problem getting the remote file.
OSError
Whenever there's not enough space to download the file
"""
This function will allow users to download a newer version of the file from a totally different url. This function is also a context manager.
with remote_data_manager.replace_file(remote_url='http://myserver/file1.fits',
shasum='1862354754121',
file_name='stuartsfile'):
myfunction()
This will download a new version of the file from a different url, validate the downloaded file with the user provided SHA hash and add it to cache. replace_file
will change current_version
in the cache database to point to the newly downloaded file, so when get
function is called inside myfunction
, it returns the path to the newly downloaded file. At last, after myfunction
is executed, replace_file
will again change current_version
to default.
This function will also allow users to use older versions of file. If the SHA hash provided by the user is already present in the cache(older versions), replace_file
will change current_version
of the file to the user provided SHA hash. Thus when get
function is called , it returns the path of the file specified by the user.
def get_cached_urls(file_name=None,time_range=None):
"""
Get the list of URLs along with their filename in the cache.
Especially useful for looking up what files are stored in
your cache when you don't have internet access.
Parameters
----------
file_name : str , optional
If provided, function will return cached urls
for the provided file name.
time_range : TimeRange , optional
If provided ,will return the cached urls
during that time range .
Returns
-------
cached_urls : list
List of cached URLs along with their filename.
"""
def clear_download_cache(hash_or_filename=None):
"""
Clears the data file cache by deleting the local file(s).
Parameters
----------
hash_or_filename : str or None
If None, the whole cache is cleared. Otherwise, either specify a
hash for the cached file that is supposed to be deleted, or a filename that
should be removed from the cache if present.
"""
( Utility functions are greatly inspired by AstroPy's data
module.)
The functionality of each utility function is aptly explained in the string docs.
Helper functions like compute_hash
, get_free_space_in_dir
will be directly imported from AstroPy's data Module.
Now, to maintain the integrity of the data stored in cache database, cache directory must be locked before writing. For this, wrapper functions will be written around AstroPy's _acquire_download_cache_lock
,_release_download_cache_lock
. This will provide a way to lock cache directory before writing.
This is merely a modest sketch. I have tried to be as lenient as possible in assigning the weekly tasks.I will be writing developer documentation along with user documentation.The last 1-2 days of a Week will usually be reserved for code review or buffers to complete work which has not been completed in the allotted time.
Though I intend to keep in touch with the mentor throughout the week and ensure that I am going in the right direction, I have tried to maintain a feasible feedback mechanism in the schedule so as to ensure that my mentor gives feedback on a reviewable quantity of code at appropriate intervals.
Apart from below mentioned timeline, I will also write blogs weekly, so that mentors can evaluate and keep track of work that I am doing.
-
Pre GSoC Period (upto 23 April)
- Continue working on SunPy issues and open new Pull Requests on GitHub. Read information/documentation related to the project.
-
Community bonding period (23 April - 13 May)
- Read the documentation and get more familiar with the code base(especially modules which use remote data).
- Get familiar with JSON, pytest, urllib, decorators, context managers.
- Read code and get more familiar with the downloading and caching mechanism used by AstroPy's
download_file
and try to get an idea of what challenges could possibly arise while implementing the new caching mechanism. - Thoroughly discuss with mentors the changes (if any) they want in above-proposed API and get a final idea of how to approach the project.
-
Week 1 - Week 3 (May 14 -June 3)
- Discuss with the mentors and finalize the structure of cache database.
- Write wrapper function for _acquire_download_cache_lock and_release_download_cache_lock.
- Implement require function (a basic cache and download system) along with all the helper functions needed.
- Write tests and documentation for require function.
-
Week 4 (June 4 - June 10)
- Implement
get
function. - Write tests and documentation for
get
function.
- Implement
-
Week 5 (June 11 - June 17)
- Add any missing test cases and fix bugs(if any).
- Send a PR for review.
- It is not necessary that everything goes as planned out and there might be unavoidable delays due to a nasty bug hidden from eyesight.Therefore, this week is kept as a buffer.
-
Week 6 and Week 7 (June 18 - July 1)
- Discuss with mentors how the above-proposed API for overriding downloaded data can be improved to make it more robust and user-friendly.
- Address the requested changes( if any ) in the previously mentioned PR.
- Successfully implement an override mechanism, allowing users to re-download the data (
skip_hash_check
andreplace_file
).
-
Week 8 (July 2 - July 8)
- Write tests and documentation for
skip_hash_check
andreplace_file
. - Fix Bugs (if any ).
- Write tests and documentation for
-
Week 9 (July 8 - July 15)
- This week is kept as a buffer to complete any left-over in the previous periods
- Update the previously opened PR.
- Clean the code.
Phase 2 Evaluation
-
Week 10 and Week 11(July 14 -July 28)
-
Write Examples for the gallery on how to use this module. Examples will be written for o following 3 scenarios:
- Download a file from a remote server.
- Replace a file if the data changes on the remote server.
- Replace a file with file from a totally different server.
-
Update the previously opened PR.
-
Implement utility functions
-
-
Week 12 and Week 13 (July 29- August 14)
-
Add Tests and Documentation for Utility Functions
-
Write Examples for utility functions.
-
See if there exist any bugs that were not addressable before. Any pending work or issues will be addressed to during this time
-
Solve merge Conflicts
-
Complete leftover documentation(both user and developer).
-
Solve PEP8 Issues
-
Address the requested changes( if any ) in the previously opened PR.
-
Get the PR merged into the master branch.
-
Language : Python
Modules : json, urllib , pytest, contextlib.
Once implemented, remote_data_manger will provide the following benefits :
- A version control remote data management system, allowing users to download data from remote servers.
- It will provide a caching mechanism so that the user doesn't have download the data again and again.
- It will provide a override mechanism so that user can redownload the same file if the file has changed on remote server.
- It will also make sure that, if remote data is available, one version of SunPy always give the same output.
I am pretty confident of completing this project because I have fairly good experience in python. Few of my projects that I am personally fond of are :
- Automated text summarization using nltk.
- Object detection using keras framework.
- Question Answering Chatbot for AgNext. ( The code for this project is confidential ) My fascination towards open source is rather new. I started contributing to SunPy in January. Apart from SunPy , I have also contributed to SymPy . I will strictly follow my proposed timeline .I will be pushing my changes to remote fork daily and will also keep updating my mentors over my progress both through chat and blog.I will spend a lot of my time to have a good hands-on experience before the coding period starts. This will help me to avoid any problems later.Even after the GSoC coding period ends, I will be available if anybody has questions regarding my work.
I have my end semester exams from 20 April - 28 April. So, I won't be completely available during the Community Bonding Period. However, I would try my best to complete the tasks mentioned in the Community Bonding Period.
I don't have any other internship during this summer. I commit working for 40 hours a week (Mon-Sat) during GSoC period. I am also willing to work on Sundays if my mentor(s) and I feel that I am falling behind on my project work.
I might take a vacation of 4-5 days during the GSoC period. However, I will continue to work on the project during that period, although for a lesser amount of time.I have distributed my work accordingly.
Even after my new semester starts(16 July), I can easily dedicate around 5 hours during weekdays and 8 hours during weekends, as the academic load during the start of the semester is quite low.However, I have distributed work in such a way that there is less load in the last 4 weeks.
No, I have not participated in GSoC previously. This is the first time I am taking part in GSoC.
No, I am not applying to any other project.
Yes, I am eligible to receive payments from Google. For any queries, clarifications or further explanations, feel free to contact yashjainjain1704@gmail.com.