Skip to content
Nabil Freij edited this page Feb 22, 2024 · 3 revisions

Scraper Rewrite

Contributor Information

Name: My Le

University: Case Western Reserve University

Year: Sophomore

Email: mhl88@case.edu

Slack: mhl88@case.edu

Github: StupidPork

PR link(s): Updated Astro_Images.ipynb

Background

I am a sophomore at Case Western Reserve University majoring in Computer Science and Computer Engineering. I have worked on multiple research projects using Python to implement machine learning models and have experience using PyTorch and data visualization libraries such as matplotlib and seaborn. As my primary concentration is Artificial Intelligence, I also have a solid mathematical foundation in linear algebra, convexity and optimization, and analysis, and a background in statistics and probability. In addition, I have a passion for astronomy and astrophysics, and have competed in the Science Olympiad in Astronomy in high school.

I am also proficient in Java programming and am currently working on a few fullstack web and mobile development projects using HTML, CSS, Javascript, and MERN stack.

I hope that joining GSoC 2023 would allow me to refine my programming skills, contribute to a codebase that serves users worldwide, and at the same time enjoy the process thanks to my joint passion for coding and astronomy.

Problem Statement

The documentation for the Scraper class is currently incomplete, lacks details and is not comprehensive, hence making it hard for developers to extract information from the API and use the methods. The algorithms are also not optimal for big data processing.

Project Implementation

Data extraction should be consistent and accurate, and the source code for the Scrapper class should be optimized and cleaned to boost readability and maintainability. In addition, there are a lot of different kinds of websites on the Internet, thus the data that the scraper might need to work with are extremely diverse. This can lead to a few problems:

+) The formatting of the dates is also an issue because different countries in the world have different ways to format the dates, which leads to a higher data variance. In order to mitigate this, I will use the Python library dateutil to parse different date and time formats and convert them into a standardized format.

+) I will optimize data retrieval, using efficient data retrieval techniques, such as caching or pre-fetching, to reduce the time it takes to retrieve data. This would reduce the amount of network traffic and improve performance. Processing speed should be maximized without causing too much strain on the website's servers.

For caching, I would suggest using database caching with MongoDB, a NoSQL database. It is highly scalable, and provides features like replication and sharding to support high availability and performance.

For pre-fetching, I will use asynchronous programming. This can be done using the library asyncio in Python.

+) The class should have good error-handling capabilities and continue the scraping process if possible instead of crashing or stopping at an error to ensure robustness. This could include catching and logging exceptions, or providing better error messages to the user.

+) Due to the large volume of data, algorithms used by the Scrapper class should be optimized for big data processing.

I would use distributed processing for this project, because the scraper we are using is network-bound (limited by network bandwidth and latency) and requires a lot of memory due to the volume of data. In distributed processing, each node can have its own memory space, satisfying memory requirements. Distributed processing is also suitable for large-scale scaping, and it handles errors gracefully, ensuring robustness.

I propose using Apache Spark for this task, as it can handle large-scale data processing, supports multiple programming languages, including Python, and provides a lot of tools for distributed processing, machine learning, and graph processing. It also has a lot of users, so community support is prevalent.

If multiple machines are not available, I would suggest multithreading because scraping is an I/O-bound task that does not require significant CPU processing. I would use the threading module in Python to create threads to run tasks in parallel, using the Thread class to create new threads and the start method to start them.

+) Proxy support should be added. The scraper should be able to use proxies to bypass restrictions or limitations on the target website. I would use a proxy rotation service, like Scraper API, to ensure the proxies are not blocked or blacklisted by the target website. This would improve the reliability of the scraper.

+) I would also improve customizability. There should be methods to allow users to customize the scraping process, such as specifying the data to be extracted, how often the extract process should be, and how the data should be formatted. In addition, the scraper should be customized for the target website to improve its accuracy and efficiency. This includes adjusting the scraper's settings, such as the scraping frequency and the data extraction method.

+) New features should be added to the scraper to make it more versatile and useful. For example, I will add a feature to automatically extract data from a list of websites, and to automatically schedule scraping jobs.

+) I would ensure the scraper is kept up to date with the latest developments in web scraping and data processing, using the most up-to-date and effective libraries and frameworks.

+) The documentation for the Scraper class could be refined by providing clear examples and use cases. This would help users to understand how to use the Scraper class and how it works, as well as what its limitations are.

+) I would also add support for authentication (OAuth).

+) I would add support for data filtering. This would allow users to filter the data that is retrieved from the data source, reducing the amount of data that needs to be processed.

+) I will ensure privacy and security of the target website and its users, complying with current privacy laws and regulations. It should not cause any harm or damage to the website, its servers, or its users, ensuring the best user experience possible.

Why I chose this project

As mentioned above, I also have a strong passion for astronomy and would love to integrate my programming and astronomy interests through this project. Rewriting an API is also a challenging task that requires deep understanding of the API's functionality and the requirements of the new implementation, which allows me to hone my problem-solving skills and push myself to the limit. Coming up with new ways to improve the Scraper API gives me a chance to exercise my creativity and come up with new and innovative solutions to problems, and I believe that my understanding of code architecture and best practices will also be significantly enhanced. Rewriting an API is also a great way for me to learn new skills and technologies, as I explore new ways of implementing the same functionality. Lastly, rewriting an API can have a significant impact on the product or service that uses it, and I am excited to make a meaningful contribution to the team or organization through contributing to an open source project that will serve many users worldwide.

Why I think I am a good fit for this project

With my background in Python programming for artificial intelligence and machine learning, I believe that I possess the necessary technical skills and knowledge for this project. Additionally, my familiarity with the SunPy API would allow me to start contributing right away. I am also familiar with Git version control and have used it for multiple projects both in and outside of class before. However, I have not contributed to open source before. My pull request to the SunPy repository was my first experience with open source.

What other people have done on this idea

The API has been revised many times in the past, with modifications and suggestions such as using parse instead of regex for the scraper to improve versatility, redesigning Scraper due to persisting bugs and issues (for instance, to convert Astropy Time objects to normal calendar dates), extracting code to a higher level, and implementing parallel queries

Timeline

Community Bonding Period (May 4 - May 28)

Get to know the mentor and other members of the OpenAstronomy community

Get familiarized with the existing codebase and project documentation and tools used in the project

Setup a development environment.

Discuss the project requirements and goals with the mentor and the API Design

Week 1 - 2 (May 29 - June 12)

Get used to the functionalities of the current Scraper

Come up with and implement a general outline of what the new Scraper class should look like, refining the user interface if needed

Week 3 - 4 (June 13 - June 26)

Figure out if we can use parse instead of python regex.

Implement functionalities described above if suitable.

Week 5 - 6 (June 27 - July 10)

Finish implementing functionalities, finish a functional scraper.

Testing and debugging.

First Evaluation (July 11, 2023)

Partial skeleton of scraper written

Week 7 - 8 (July 14 – July 24)

Clean up and refactor code as necessary, improve readability and maintainability.

Add any missing functionality, if applicable.

Maximize performance.

Week 9 - 10 (July 25 - August 21)

Continue improving performance metrics.

Writing documentation and preparing for final evaluation.

Finalizing last-minute changes.

Submit code for review and feedback from mentors.

Final Week (August 21 - August 28)

Functional replacement ready for review and merging into SunPy.

Make final code improvements based on feedback from mentors.

Submit final code and documentation for review and evaluation.

GSoC Questions

Have you participated previously in GSoC? When? With which project?

I haven't

Are you also applying to other projects?

I'm not

Are you eligible to receive payments from Google?

I am

How much time do you plan to invest in the project before, during, and after the Summer of Code?

I can work on it for around 30 hours per week. I am doing undergraduate research at my school over the summer, but other than that, I am not taking classes or any other commitments.

Clone this wiki locally