Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

OpenPAI Backlog #4512

Open
5 tasks
scarlett2018 opened this issue May 11, 2020 · 2 comments
Open
5 tasks

OpenPAI Backlog #4512

scarlett2018 opened this issue May 11, 2020 · 2 comments
Assignees
Labels
Milestone

Comments

@scarlett2018
Copy link
Member

scarlett2018 commented May 11, 2020

This issue is a long term backlog for planning and discussion. Please feel free to add "the top in your mind" to this issue directly.

Last Updated on 02/01/2021

Top Focus Scenario

  • Support HiveD Management experience
  • AI Optimized Workload E2E support
    • use case study for internal & external typical workloads

Multi-Cloud/Multi-Cluster
- Hybrid installation
- multi/cross cluster management
- multi-cluster job scheduling

Job management

  • Favorite Jobs

AutoScaler

  • new added node's provisioning

Engineering Excellent

  • CI/CD
  • Build process

More examples for the current examples

Add user team support

  • Jupyter Notebook support

  • FC detect the preemption, gc, eviction, etc. @yqwang-ms

  • FC rescale Framework @yqwang-ms

  • Update the official docker image (comparing with deepo) @hzy46

Installation experience

@fanyangCS
Copy link
Contributor

#3872

@scarlett2018 scarlett2018 pinned this issue May 25, 2020
@scarlett2018 scarlett2018 self-assigned this May 28, 2020
@scarlett2018 scarlett2018 changed the title Post 0.18 brainstorming and planning OpenPAI Backlog Jun 22, 2020
@Binyang2014 Binyang2014 unpinned this issue Jul 9, 2020
@fanyangCS fanyangCS pinned this issue Jul 12, 2020
@scarlett2018
Copy link
Member Author

History Backlog Info Backup, only for reference.

Brainstorming on 2020/09/10

[Planned in #4898 ] Cell as sku in hived scheduler.

  • Refactor duplicate deployment code.
  • Refactor job submission form.

[Planned in #4898] Support dynamic sku types in different vc.

@hzy46 :
[Planned in #4898] pod/event watcher to support #4649
[Planned in #4898 ] multi cluster management detailed design & review

@suiguoxin: #4789
[Planned in #4898] Alert-manager: Kill low-gpu-utilization jobs, tag abnormal jobs, Cordon node with k8s API when GPU GCC Error
[Planned in #4898] Job tags : DB table, rest-server API, web-portal refactor

@yiyione:
[Planned in #4898] Group management page in webportal
[Planned in #4898] VC request management for user and admin

  • VSCode client bug fix

@debuggy:

  • marketplace related items 2020 Sept Release Plan openpaimarketplace#60
    [Planned in 2020 Sept ~ Oct release plan #4898]Grammar check
    • NNI job integration -- Feature Engineering
    • add info for items in submit job page
    • add "Upload" feature for dataset, set upload standard and workflow
  • job submit refactor
    • new submit workflow design
    • team storage
    • sku
  • job detail work part 2
    • job retry more detail
    • show events in job detail page

Per Task Retry History

  • FC generates Task history snapshots (mocked Task CRD object with UID) @yqwang-ms
  • Fluentd + DB to collect above snapshots @hzy46
  • WebPortal expose the Per Task Retry History (and failure counts grouped by failure type within one job?) @debuggy

Brainstorming on 5/14 - last updated at 5/28

Multi-Cloud/Multi-Cluster
@hzy46
- Hybrid installation
- multi/cross cluster management
- multi-cluster job scheduling

AutoScaler @ydye

  • new added node's provisioning

Support for large scale cluster

  • (June) set up latency benchmark report for 100 virtual nodes --> 1500 virtual nodes * 8 job @ydye
  • based on the stress and performance test result, analyze and plan for improvements

GPU Utilization

More examples for the current examples

Add user team support

Surfacing more backend error to users in Job Details Page #4649
@qfyin ,@yqwang-ms
potential failure case: storage mount failed

Installation experience

These are incomplete items from v0.17


Engineering Excellent

  • CI/CD
  • Build process

Low Priority Postponed items

  • (Postponed from Aug 2020 release) Elastic DL and Job level scale up and scale down - @yqwang-ms check with Ming/NNI and other integration scenarios - TODO: summarize the to-the-customer level scenarios supports

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants