ppl-accel AT lists.siebelschool.illinois.edu
Subject: Ppl-accel mailing list
List archive
- From: TMS <tms AT charm.cs.illinois.edu>
- To: ppl-accel AT cs.illinois.edu
- Subject: [[ppl-accel] ] TMS - New Log in Task Meeting Minutes
- Date: Fri, 28 Jun 2019 17:02:33 -0500
- Authentication-results: illinois.edu; spf=none smtp.mailfrom=tms AT charm.cs.illinois.edu; dmarc=none header.from=charm.cs.illinois.edu
Michael Robson has added a new log to Task Meeting Minutes
Log Entry:
Accel Meeting
11 - 12:30
In Attendance: Nikolai, Juan, Ronak, Michael, Jaemin, Juan
Agenda
GPU Overview
Technical Details
Nikolai's relevant work
- GPU centric comm semantics
- Similar to Nicol - inspiration
- also open source on github
- well engineered (code less so)
- on nvidia's github
- Comm as a cuda kernel
- enq on stream - dependencies
- streams = cpu threads (logical equivalent)
- want same MPI semantics
- CUDA aware MPI
- unaware of users computation on GPU
- do synch manually
- lack of comm/comp overlap - bsp model
- launch overhead - 10us
- latency on IB - 1 us
- implementation
- cpu bg thread
- cpu follow progress via cuda events
- gpu wait - 2 ways
- launch spin wait kernel
- 1 thread
- mem address - unified memory - poll on int exit on nonzero
- Nicol uses this - more portable
- cuda driver call - stream memory options section
- need cuda kernel flag driver support
- mem address w produceer and consumer
- bit 1/0, counter, etc
- cuStreamWait32 - basically opposite of events
- not just launching kernel
- mananged by SM scheduler or cuda kernel driver on CPU side
- cuStreamWrite32 - can impl events using this
- approach - aluminum on github under nikolai at livermore
- use by some apps at LLNL
- wrapper around Nicol
- MPI collectives + send/recv + send + recv
- integration - associate MPI stream with communicator - one user stream
- eqv to setting attribute on communictor
- communicator associated with stream
- semantics basically same on CPU
- could do this on CPU for threads/chares - notions of stream of exec
- don't support multiple streams with one comm
- could do this if careful
- assumption
- 1 rank per gpu
- comm thread impl
- c++ posix thread
- bound appropriately - hwloc
- need to initiate MPI operations in the right order
- similar to bcst/reductions in charm
- runtime could get around
- tags (if they were supported - boo MPI) gets around this
- mostly proof of concept
- using GPU direct RDMA would improve perf and remove issues
- have to be wriring IB code to use
- could use MVAPICH GDR
- lots of bugs in MPI distros with multithreaded GPU usage
- recap
- comm another cuda kernel
- runtime does 'magic' to make it work like non-blocking MPI
- up to runtime implementer to make magic happen / efficiently
What Marc/Pavan want moving forward
- GPU oriented comm work
- can even do from charm side
- def want from MPI side
- want to put semantics in MPI
- endpoints MPI proposal
- openmp + GPU + MPI
- probbaly won't happen
Other projects
- pitched to MVAPICH - want a paper
- MPI forum - need proof of concept in a (major) MPI distribution
- some nvidia (affiliated) projects
- nicol - p2p on roadmap for future - want to expand to SciCo
- historically - pushing IB verbs to GPU
- NVSHMEM - openshmem semantics on GPU - public v intranode (private v avail internode)
- puts/gets inside GPU kernels
- cuda scheduler can deschedule threads as if pending device mem reads
- natural fine grained comm/comp overlap
- TensorFlow - does internally (maybe some other rts)
Charm plans
- GPU aware Charm
- local compleition aware sends
Next Steps
- getting up to speed on GPUs - nvidia programming guide (Juan)
- optional: PTX ISA
Stray
- Marc also retirig today
- 32 GB V100
- where does nvidia rt run?
- some on dev
- alot on cpu
Stary2
- Pavan on committee?
- charm messages as GPU kernels? - extend Nikolai's work
- [[ppl-accel] ] TMS - New Log in Task Meeting Minutes, TMS, 06/28/2019
- <Possible follow-up(s)>
- [[ppl-accel] ] TMS - New Log in Task Meeting Minutes, TMS, 06/28/2019
Archive powered by MHonArc 2.6.19.