GSoC 2023 SN Pradeep

[SunPy] Refactoring and Improving the Maintainability of Sunpy's Scraper Module

Personal Details and Contact Information

Name : SN Pradeep
Email : pradeepsn606@gmail.com
Github : @gmrpr321
College : Velammal College of Engineering and Technology, Madurai, TamilNadu,India.
Element : @pradeepsn:matrix.org
Timezone : Indian Standard Time (UTC +05:30)

I am SN.Pradeep, currently pursuing B.E in Computer Science. I am passionate about exploring the exciting intersection of computer science and engineering to solve complex problems and create innovative solutions. I am also interested in astronomy and physics, and I am excited to work on an open-source project that combines these interests with my passion for computer science and engineering

Synopsis

The Scraper class is responsible for implementing the web scraping functionality within SunPy, and is used by some of the simpler internal clients to scrape web pages for data files and metadata within a specified time range (sunpy.time.timerange.TimeRange). Current implementation of Scraper class heavily relies on Python Regular Expression in formatting URLs, constructing filePaths and extracting dateTime information. This reduces the overall Readability and Maintainability Of Scraper module.

The goal is to construct a newer Scraper module that retains the functionality of the pre-existing module while improving its readability and Maintainability.

Refinements

Several areas of the code can be improved if used parse instead of regex. Here are some methods in Sunpy.net.scraper.Scraper that will have improvements

In the init method, the pattern parameter is already formatted using kwargs. Instead of using regex to extract datetime formats from the pattern, parse can be used to extract named placeholders from the pattern
In the _extractDateURL method, regex is used to extract the date from a particular URL following the pattern. This can be simplified using parse to extract named placeholders from the pattern and parse the date using the corresponding format.
In the _URL_followsPattern method, TIME_CONVERSIONS (a dictionary to map time format codes to regular expressions) is used to replace parts of the pattern to ensure the given URL follows the pattern. But this reduces the readability. A parser can be used to simply things

The current Implementation to replace time

`TIME_CONVERSIONS = {'%Y': r'\d{4}', '%y': r'\d{2}',
'%b': '[A-Z][a-z]{2}', '%B': r'\W', '%m': r'\d{2}',
'%d': r'\d{2}', '%j': r'\d{3}',
'%H': r'\d{2}', '%I': r'\d{2}', '%M': r'\d{2}',
'%S': r'\d{2}', '%e': r'\d{3}', '%f': r'\d{6}'}

pattern = self.pattern
for k, v in TIME_CONVERSIONS.items():
pattern = pattern.replace(k, v)
matches = re.match(pattern, url)`

This can be re-written without using regex , here is an example

import datetime
input_pattern = 'http://proba2.oma.be/swap/data/bsd/%Y/%m/%d/swap_lv1_%Y%m%d_%H%M%S.fits'
input_url = 'http://proba2.oma.be/swap/data/bsd/2000/04/10/swap_lv1_20041224_120743.fits'
supported_date_formats = [
'%Y/%m/%d',
'%Y%m%d_%H%M%S',
'%Y-%m-%d',
'%Y-%m-%dT%H:%M:%S.%f',
'%Y%m%dT%H%M%S.%f',
#can add more formats if needed
]

pattern_lst = []
input_lst = []
start_index = 0
end_index = 0

# split the pattern based on supported patterns
for format_date in supported_date_formats:
if(format_date in input_pattern):
end_index = input_pattern.find(format_date) +
len(format_date)
pattern_lst.append(input_pattern[start_index : end_index])
current_pattern = input_pattern[start_index : end_index]
start_index = end_index

# split the input url based on pattern_lst
start_index = 0
end_index = 0
for pattern in pattern_lst:
end_index = start_index+len(pattern)
if('%Y' in pattern):
end_index+=2
input_lst.append(input_url[start_index:end_index])
start_index = end_index

# for each pattern_lst, check if input_list contains valid format
for x in range(0,len(pattern_lst)):
try:
datetime_obj =
datetime.datetime.strptime(input_lst[x],pattern_lst[x])
except ValueError:
print('Input URL doesn't match')
else:
print('Input URL matches')

In the _localfilelist method, regex is used to return a list of local file paths that match a certain pattern. This can be simplified using parse to extract named placeholders from the pattern and generate a list of file paths by substituting datetime objects in the placeholders.

These are just a few examples, overall the readability and maintainability of Scraper module could be improved with proposed implementation.

Deliverables

Expected by Evaluation 1 :

Write utility functions to extract datetime, calculating time range and to match given file paths with a specific date and time.
To develop partial Scraper and to provide documentation for classes created thus far.
Create tests for all classes created thus far.

Expected by Evaluation 2 :

A fully functional Scraper written to support http and ftp protocols.
Provide adequate Examples and documentation for the entire refactored module
Implement performance improvements and create benchmarks
Integration of the updated Scraper class into the sunpy codebase

Timeline

Community Bonding Period (May 4 - May 28)

Approach the mentors to understand the structure of the current scraper
Read Documentation / Get familiar with sunpy library's codebase

Week 1-2 (May 29 – June 12)

Implement parse-based URL pattern conversion
Modify the Scraper class to use parse instead of regex for URL pattern conversion
Update the class documentation to reflect the changes made
Test the modified Scraper class using existing test cases and fix any issues that arise

Week 3-4 (June 13 - June 26)

Refactor the matches method
Update the method documentation to reflect the changes made
Test the modified method using existing test cases and fix any issues that arise

Week 5-6 (June 27 - July 10)

Refactor the _URL_followsPattern and _extractDateURL methods to use parse instead of regex for URL pattern matching and date extraction
Update the method documentation to reflect the changes made
Test the modified methods using existing test cases and fix any issues that arise

Week 7-8 (July 11 - July 24)

Refactor the filelist method to use parser instead of regex for URL pattern matching and date extraction
Update the method documentation to reflect the changes made
Test the modified method using existing test cases and fix any issues that arise

Week 9-10 (July 25 – August 7)

Refactor the _ftpfileslist and _localfilelist and extract_files_meta methods to use parse instead of regex for URL pattern matching and date extraction
Update the method documentation to reflect the changes made
Test the modified methods using existing test cases and fix any issues that arise

Week 11-12 (August 8 – August 21)

Finalize the documentation for the modified Scraper class, including any changes made to the method documentation
Test the entire modified Scraper class using existing test cases and fix any remaining issues
Work on performance improvements

Final Week(August 21 – August 28 )

Make final code and documentation improvements
Submit final code and documentation for review and evaluation

Contributions

#6895 Marks sunpy.io readers Private

Me and GSoC

Have you participated previously in GSoC? When? With which project? No
Are you also applying to other projects? No

Schedule availability

Due to final exams, I will not be available during May 15 – 31.
Due to practical exams (May 1 - 6) , I could only spend a limited amount of Time(2-3 hrs)
On other days, I can work for 5-6 hrs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly