Skip to content

Latest commit

 

History

History
66 lines (48 loc) · 1.86 KB

README.md

File metadata and controls

66 lines (48 loc) · 1.86 KB

Introduction

I was asked if I could hack together something for processing the category pages until nth page for their exam preparations.

Having an offline copy of the articles would be useful for reading material.

Dependency

The project depends on modules listed in the requirements.txt file. The program uses Python 3.6+. Install dependencies with pip:

pip install -r requirements.txt

Usage

The -h parameter to the executable main.py provides you with:

$ ./main.py -h
usage: main.py [-h] [--start START] category page_num

Parse and save as PDF range of pages from gktoday.in for category

positional arguments:
  category              Provide category for the website such as `environment-current-affairs`
  page_num              Page until which to parse

optional arguments:
  -h, --help            show this help message and exit
  --start START, -s START
                        Start parsing from this page

which is enough to begin putting together the pages. The generated PDF files are placed into the directory structure as follows:

articles/<category>/<month-name>/<article-title>

Combining all files to a single files

While I could do this in python itself, I thought that having it create separate pdf for individual articles is better. Using quite famous poppler-utils commandline utility, namely, pdfunite; you can generate the final files.

cd articles
for i in *
do
  cd $i
  for mo in *
  do
    cd mo
	pdfunite *.pdf ../$mo.pdf
	cd ..
  done
  pdfunite *.pdf "${i}.pdf"
done

Credits

  1. weasyprint
  2. requests-html
  3. poppler-utils
  4. This stack overflow post for suggesting weasyprint
  5. Another SO thread for pdfunite
  6. GKToday.in for the content