Skip to content
Cathryn Carson edited this page Jan 23, 2017 · 6 revisions

Managing resources and data on the cluster

This is a quick guide to the computing resources we have on the cluster, and what considerations you should take into account in designing your course materials. Each user instance has roughly 1 CPU, 10GB storage, and 2GB memory. For a more in-depth discussion of the hardware used in the course and the software used to manage it, see the technical infrastructure page.

General Overview

Our Jupyterhub is organized into nodes (computers), each of which has several CPUs and a finite amount of memory. Each student also has disk space on the jupyterhub cluster. When people go to the datahub.berkeley.edu address, jupyterhub will create an instance of python + all class materials, and will reserve a CPU for them to run code. There are ~8 CPUs for each node, and they are shared by all the users currently logged-in to that node.

CPU restrictions

While CPU will restrict how fast your code runs, it's usually not the bottleneck on the cluster. Obviously you want to limit the complexity of the analyses you'd like students to run, and a good rule of thumb is to try and simplify analyses / datasets, and then discuss how they'd be scaled up. For example, running simple machine learning on in-memory data is probably fine. Fitting a multi-layer neural network on multiple batches of data is probably going to take a long time.

One thing that will tie up resources is running commands that request multiple CPUs. Please avoid using parallel computing in your code (e.g., joblib, or n_jobs > 1 in scikit-learn). If you really want to do this, contact the admins for the datahub, and we will try to set up something that works for you.

Memory restrictions

Memory management is a little tricky on the cluster. Currently there is 2GB for each student's session (though this may be different in future iterations). In the top-right corner of notebooks you will find the amount of memory currently used out of the total amount available to students.

What if I run out of memory?

  • If you run out of memory, the python kernel will automatically restart. This can be frustrating for students because they don't get any kind of message that tells them memory is the problem. It's a good idea to let them know ahead of time.
  • If restarting the kernel and re-running the code doesn't help, then usually there is another notebook open from a previous session. Go to the "running notebooks" page, and close anything you're not using right now.
  • If all other notebooks are closed and you're still running out of memory, then log out and log back in to your jupyterhub account. This will close your session and then start a new one. There have been reports of stray memory being taken up by python even though processes have finished running.

Storing data and I/O

If you've got data that you want students to use, you have several options:

  1. Package the data with a github repository. (Works if the data is relatively small and won't be modified.)
  2. Host the data online somewhere and have students download it. (Useful for medium-sized data, and often works well with a helper function you write, e.g. download_dataset(url).)
  3. Host the data on the cluster itself, in a shared directory. For this, speak with the cluster administrator to set up a read-only folder that you can direct students towards. (This is best for large datasets on the order of several GB.)
  4. Use the upload button in Juypter to manually load data from your laptop (this is a bit clunky, so try the other options first if possible).