Skip to content

laszewsk/deprecated-osmi

Repository files navigation

mlcommons-osmi

deprecated: please see https://github.com/laszewsk/osmi-bench for a new version

GitHub Repo GitHub issues Contributors License Linux

Other Repos

General badge General badge General badge General badge

Authors

Table of contents

1. Running OSMI Bench on Ubuntu natively

1.1 Create python virtual environment on Ubuntu

Note:

  • tensorflow, 3.10 is the latest supported version

  • smartredis, python 3.10 is the latest supported version

  • Hence, we will use python3.10

First create a venv with

ubuntu> 
  python3.10 -m venv ~/OSMI
  source ~/OSMI/bin/activate
  pip install pip -U

1.2 Get the code

We assume that you go to the directory where you want to install osmi. We assume you do not have a directory called osmi in it. Use simply ls osmi to check. Next we set up the osmi directory and clone it from github. To get the code we clone this github repository

https://github.com/laszewsk/osmi.git

Please execute:

ubuntu>
  mkdir ./osmi
  export OSMI_HOME=$(realpath "./osmi")
  export OSMI=$(OSMI_HOME)/
  git clone https://github.com/laszewsk/osmi.git
  cd osmi
  pip install -r target/ubuntu/requirements-ubuntu.txt

1.3 Running the small OSMI model benchmark

cd models
time python train.py small_lstm  # ~   6.6s on an 5950X with RTX3090
time python train.py medium_cnn  # ~  35.6s on an 5950X with RTX3090
time python train.py large_tcnn  # ~ 16m58s on an 5950X with RTX3090

1.4 TODO: Install tensorflow serving in ubuntu

This documentation is unclear and not tested:

Unclear. the documentation do this with singularity, I do have singularity on desktop, but can we use it natively and compare with singularity performance?

echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
sudo apt-get update && sudo apt-get install tensorflow-model-server
which tensorflow_model_server
make image

2. Running on UVA Rivanna

2.1 Logging into Rivanna

The easiest way to log into rivanna is to use ssh. However as we are creating singularity images, we need to currently use either bihead1 or bihead2

Please follow the documentation at

http://sciencefmhub.org/docs/tutorials/rivanna/singularity/

to set this up

Best is to also install cloudmesh-rivanna and cloudmesh-vpn on your local machine, so that login and management of the machine is simplified

local>
  python -m venv ~/ENV3
  pip install cloudmesh-rivanna
  pip install cloudmesh-vpn

In case you have set up the vpn client correctly you can now activate it from the terminal including gitbash on windows. If this does not work, you can alternatively just use the cisco vpn gu client and ssh to one of biheads.

In case you followed our documentation you will be able to say

local>
  cms vpn activate
  ssh b1

Furthermore we assume that you have the code also checked out on your laptop as we use this to sync later on the results created with the super computer.

local>
  mkdir ~/github
  cd ~/github
  git clone git clone https://github.com/laszewsk/osmi.git
  cd osmi

To have the same environment variables to access the code on rivanna we introduce

local>
  export USER_SCRATCH=/scratch/$USER
  export USER_LOCALSCRATCH=/localscratch/$USER
  export BASE=$USER_SCRATCH
  export CLOUDMESH_CONFIG_DIR=$BASE/.cloudmesh
  export PROJECT=$BASE/osmi
  export EXEC_DIR=$PROJECT/target/rivanna

This will come in handy when we rsync the results. Now you are logged in on frontend node to rivanna.

2.2 Running OSMI Bench on rivanna

To run the OSMI benchmark, you will first need to generate the project directory with the code. We assume you are in the group bii_dsc_community, and

SOME OTHERS MISSING COPY FROM OUR DOCUMENTATION

so you can create singularity images on rivanna.

As well as the slurm partitions gpu and bii_gpu

We will set up OSMI in the /scratch/$USER directory.

2.3 Set up a project directory and get the code

First you need to create the directory. The following steps simplify it and make the instalation uniform.

b1>
  export USER_SCRATCH=/scratch/$USER
  export USER_LOCALSCRATCH=/localscratch/$USER
  export BASE=$USER_SCRATCH
  export CLOUDMESH_CONFIG_DIR=$BASE/.cloudmesh
  export PROJECT=$BASE/osmi
  export EXEC_DIR=$PROJECT/target/rivanna

  mkdir -p $BASE
  cd $BASE
  git clone https://github.com/laszewsk/osmi.git
  cd osmi

You now have the code in $PROJECT

2.4 Set up Python Environment

Note: This is no longer working

OSMI will run in batch mode this is also valid for setting up the environment for which we created sbatch script. This has the advantage that it installed via the worker nodes, which is typically faster, but also gurantees that the worker node itself is ued to install it to avoid software incompatibilities.

b1>
 cd $EXEC_DIR
 sbatch environment.slurm
 # (this may take a while)
 source $BASE/ENV3/bin/activate

See: environment.slurm

Note: currently we recommend this way:

An alternate way is to run the following commands directly:

b1>
  cd $EXEC_DIR
  module load gcc/11.4.0  openmpi/4.1.4 python/3.11.4
  which python
  python --version
  python -m venv $BASE/ENV3 # takes about 5.2s
  source $BASE/ENV3/bin/activate
  pip install pip -U
  time pip install -r $EXEC_DIR/requirements.txt # takes about 1m21s
  cms help

2.5 Build Tensorflow Serving, Haproxy, and OSMI Images

We created convenient singularity images for tensorflow serving, haproxy, and the code to be executed. This is done with

b1>
  cd $EXEC_DIR
  make images

2.6 Compile OSMI Models in Batch Jobs

To run some of the test jobs to run a model and see if things work you can use the commands

b1>
  cd $EXEC_DIR
  sbatch train-small.slurm  #    26.8s on a100_80GB, bi_fox_dgx
  sbatch train-medium.slurm #    33.5s on a100_80GB, bi_fox_dgx
  sbatch train-large.slurm  # 1m  8.3s on a100_80GB, bi_fox_dgx

GREGOR CAME TILL HERE

Run benchmark with cloudmesh experiment executor

Set parameters in config.in.slurm

experiment:
  # different gpus require different directives
  directive: "a100,v100"
  # batch size
  batch: "1,2,4,8,16,32,64,128"
  # number of gpus
  ngpus: "1,2,3,4"
  # number of concurrent clients
  concurrency: "1,2,4,8,16"
  # models
  model: "small_lstm,medium_cnn,large_tcnn"
  # number of repetitions of each experiment
  repeat: "1,2,3,4"

To run many different jobs that are created based on config.in.slurm You can use the following

b1>
  cd $EXEC_DIR
  make project-gpu
  sh jobs-project-gpu.sh

The results will be stored in a projects directory.

Graphing Results

To analyse the program it is best to copy the results into your local computer and use a jupyter notebook.

local>
  cd ~/github/osmi/target/rivanna
  du -h rivanna:$EXEC_DIR/project-gpu
  // figure out if you have enough space for this project on the local machine
  rsync rivanna:$EXEC_DIR/project-gpu ./project-gpu

Now we can analyse the data with

local>
  open ./analysis/analysis-simple.ipynb

graphs are also saved in ./analysis/out

The program takes the results from clodmesh experiment executir and produces several graphs.

Compile OSMI Models in Interactive Jobs (avpid using)

Interactive Jobs: allow you to reserve a node on rivanna so it looks like a login node. This interactive mode is usefull only during the debug phase and can serve as a convenient way to debug and to interactively experiment running the program.

Once you know hwo to create jobs with a propper batch script you will likely no longer need to use interactive jobs. We keep this documentation for beginners that like to experiement in interactive mode to develop batch scripts.

First, obtain an interactive job with

rivanna>
  ijob -c 1 -A bii_dsc_community -p standard --time=01:00:00

To specify a particular GPU please use.

rivanna>
  export GPUS=1
  v100 rivanna> ijob -c 1 -A bii_dsc_community --partition=bii-gpu --gres=gpu:v100:$GPUS --time=01:00:00
  # (or)
  a100 rivanna> ijob -c 1 -A bii_dsc_community --partition=bii-gpu --gres=gpu:a100:$GPUS --time=01:00:00
node>
  cd $PROJECT/models
  python train.py small_lstm
  python train.py medium_tcnn
  python train.py large_cnn

For this application there is no separate data

1. Running OSMI Bench on a local Windows WSL

TODO: Nate

  1. create isolated new wsl environment
  2. Use what we do in the ubuntu thing, but do separate documentation er as the ubuntu native install may have other steps or issuse

Create python virtual environment on WSL Ubuntu

wsl> python3.10 -m venv /home/$USER/OSMI
  source /home/$USER/OSMI/bin/activate
  python -V
  pip install pip -U

Get the code

To get the code we clone a gitlab instance that is hosted at Oakridge National Laboratory, please execute:

wsl>
  export PROJECT=/home/$USER/project/
  mkdir -p $PROJECT
  cd $PROJECT
  git clone https://github.com/laszewsk/osmi #git@github.com:laszewsk/osmi.git
  cd osmi/
  pip install -r $PROJECT/mlcommons-osmi/wsl/requirements.txt
wsl>
  cd $PROJECT/mlcommons-osmi/wsl
  make image
  cd models
  time python train.py small_lstm (14.01s user 1.71s system 135% cpu 11.605 total)
  time python train.py medium_cnn (109.20s user 6.84s system 407% cpu 28.481 total)
  time python train.py large_tcnn
  cd .. 

References

  1. Production Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC, Wesley Brewer, Daniel Martinez, Mathew Boyer, Dylan Jude, Andy Wissink, Ben Parsons, Junqi Yin, Valentine Anantharaj 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 978-1-6654-1124-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/MLHPC54614.2021.00008, https://ieeexplore.ieee.org/document/9652868 TODO: please ask wess what the free pdf link is all gov organizations have one. for example as ornl is coauther it must be on their site somewhere.

  2. Using Rivanna for GPU ussage, Gregor von Laszewski, JP. Fleischer https://github.com/cybertraining-dsc/reu2022/blob/main/project/hpc/rivanna-introduction.md

  3. Setting up a Windows computer for research, Gregor von Laszewski, J.P Fleischer https://github.com/cybertraining-dsc/reu2022/blob/main/project/windows-configuration.md

  4. Initial notes to be deleted, Nate: https://docs.google.com/document/d/1luDAAatx6ZD_9-gM5HZZLcvglLuk_OqswzAS2n_5rNA

  5. Gregor von Laszewski, J.P. Fleischer, Cloudmesh VPN, https://github.com/cloudmesh/cloudmesh-vpn

  6. Gregor von Laszewski, Cloudmesh Rivanna, https://github.com/cloudmesh/cloudmesh-rivanna

  7. Gregor von Laszewski, Cloudmesh Common, https://github.com/cloudmesh/cloudmesh-common

  8. Gregor von Laszewski, Cloudmesh Experiment Executor, https://github.com/cloudmesh/cloudmesh-ee

  9. Gregor von Laszewski, J.P. Fleischer, Geoffrey C. Fox, Juri Papay, Sam Jackson, Jeyan Thiyagalingam (2023). Templated Hybrid Reusable Computational Analytics Workflow Management with Cloudmesh, Applied to the Deep Learning MLCommons Cloudmask Application. eScience'23. https://github.com/cyberaide/paper-cloudmesh-cc-ieee-5-pages/raw/main/vonLaszewski-cloudmesh-cc.pdf, 2023.

  10. Gregor von Laszewski, J.P. Fleischer, R. Knuuti, G.C. Fox, J. Kolessar, T.S. Butler, J. Fox (2023). Opportunities for enhancing MLCommons efforts while leveraging insights from educational MLCommons earthquake benchmarks efforts. Frontiers in High Performance Computing. https://doi.org/10.3389/fhpcp.2023.1233877

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published