Introduction

I was asked if I could hack together something for processing the category pages until nth page for their exam preparations.

Having an offline copy of the articles would be useful for reading material.

Dependency

The project depends on modules listed in the requirements.txt file. The program uses Python 3.6+. Install dependencies with pip:

pip install -r requirements.txt

Usage

The -h parameter to the executable main.py provides you with:

$ ./main.py -h
usage: main.py [-h] [--start START] category page_num

Parse and save as PDF range of pages from gktoday.in for category

positional arguments:
  category              Provide category for the website such as `environment-current-affairs`
  page_num              Page until which to parse

optional arguments:
  -h, --help            show this help message and exit
  --start START, -s START
                        Start parsing from this page

which is enough to begin putting together the pages. The generated PDF files are placed into the directory structure as follows:

articles/<category>/<month-name>/<article-title>

Combining all files to a single files

While I could do this in python itself, I thought that having it create separate pdf for individual articles is better. Using quite famous poppler-utils commandline utility, namely, pdfunite; you can generate the final files.

cd articles
for i in *
do
  cd $i
  for mo in *
  do
    cd mo
	pdfunite *.pdf ../$mo.pdf
	cd ..
  done
  pdfunite *.pdf "${i}.pdf"
done

Credits

weasyprint
requests-html
poppler-utils
This stack overflow post for suggesting weasyprint
Another SO thread for pdfunite
GKToday.in for the content

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
.pyup.yml		.pyup.yml
Dockerfile		Dockerfile
README.md		README.md
combiner.sh		combiner.sh
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.dockerignore

.dockerignore

.gitignore

.gitignore

.pyup.yml

.pyup.yml

Dockerfile

Dockerfile

README.md

README.md

combiner.sh

combiner.sh

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Introduction

Dependency

Usage

Combining all files to a single files

Credits

About

Releases

Packages

Contributors 3

Languages

hjpotter92/gktoday.in-scraper

Folders and files

Latest commit

History

Repository files navigation

Introduction

Dependency

Usage

Combining all files to a single files

Credits

About

Topics

Resources

Stars

Watchers

Forks

Languages