charm AT lists.siebelschool.illinois.edu
Subject: Charm++ parallel programming system
List archive
[charm] CharmLU optimization for heterogeneous run (Broadwell+Intel Xeon Phi 7250)
Chronological Thread
- From: Ekaterina Tutlyaeva <xgl AT rsc-tech.ru>
- To: charm AT cs.uiuc.edu
- Subject: [charm] CharmLU optimization for heterogeneous run (Broadwell+Intel Xeon Phi 7250)
- Date: Fri, 20 Jan 2017 12:31:50 +0300
Dear support,
I'm trying to get best results for CharmLU in heterogeneous environment - 2 node with different CPUsFirst node: Broadwell
2 CPU per node
Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz;
Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz;
20 cores, 40 threads (http://ark.intel.com/ru/products/91753/Intel-Xeon-Processor-E5-2698-v4-50M-Cache-2_20-GHz)
(theoretical peak is about 665.6 GFlops double precision);
RAM: 128 GB, DDR4/2133MHz
RAM: 128 GB, DDR4/2133MHz
Second node: Intel Knight Landing
1 CPU
RAM: Intel Xeon Phi 7250 @ 1.40 GHz
68 cores, 272 threads
68 cores, 272 threads
(theoretical peak 3046.4 GFlops double precision)
RAM: 16 GB Intel MCDRAM + 192 GB DDR4
Now the best result for this integrated environment, that I was able to get, is 661.186 GFlops. (But theoretical max is 3700!)
Would you be so kind, please, to give me some hints, how can I optimize my runs and achieve the best results to get a little bit closer to theoretical values with CharmLU in my environment? May be, there are some scheduler features.. Optimizations?
What am I doing wrong? What can I optimize?
May be there is more appropriate scheduler strategy, that I can choose? (I've tried
+MetaLB scheduler). Should I manually set the process mapping scheme?
+MetaLB scheduler). Should I manually set the process mapping scheme?
The execution parameters for the best run (661 gflops):
./charmrun --bootstrap ssh -machinefile hosts +p108 ./charmlu 144000 360 1200000000 120 3
I've tried different block sizes (from 120 up to 560)
different number of processes (min=88 max = 172, while the total number of physical cores is 108 = 68 (KNL)+20*2 (2xBroadwell))
The 144000 - is maximum matrix size, that I can use. larger values crashes with segfault.
1500000000 -- is the max memory treshhold, that I can use. Larger values finished with segfault** (While 1200000000 mem treshhold limit gives best values in benchmarks)
The mapping scheme 3 (2D Tiling) and Pivot batch size: 120 are found empirically. (On the lower matrix sizes these values gives the best results. May be you could recommend best ranges for there values to benchmark in my environment?
The Send Limit: 2 parameter stays unchanged, may be I should do something with it?
My CharmLU compilation options:
OPT = -O3 -axCORE-ACX2,MIC-AVX512
config,mk uses MKL for math:
SEND_LIM = 2
BLAS_INC = -I${MKLROOT}/include
BLAS_LD = -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64
BLAS_LIBS = -lpthread -lm -ldl
OPT = -O3 -axCORE-ACX2,MIC-AVX512
config,mk uses MKL for math:
SEND_LIM = 2
BLAS_INC = -I${MKLROOT}/include
BLAS_LD = -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64
BLAS_LIBS = -lpthread -lm -ldl
I'm using charm-6.7.1.
The build options for charm:
./build LIBS mpi-linux-x86_64 smp -j14 --with-refnum-type=int -axCORE-AVX2,MIC-AVX512
Thank you very much for your time!! Sorry for the long letter...
Best regards,
Ekaterina
ps:
** [1] Stack Traceback: (for Mem Threshold (MB): 1600000000)
[1:0] _ZN14BlockScheduler13registerBlockE9CkIndex2D+0x1f2 [0x5490e2]
[1:1] _ZN5LUBlk4initE8LUConfig12CProxy_LUMgr21CProxy_BlockScheduler10CkCallbackS3_S3_+0x133 [0x539ea3]
[1:2] _ZN5LUBlk8_when_20EPN13Closure_LUBlk18startup_25_closureE+0x1f4 [0x4d8944]
[1:3] _ZN13CkIndex_LUBlk35_call_schedulerReady_CkReductionMsgEPvS0_+0x259 [0x53c8e9]
[1:4] CkDeliverMessageReadonly+0x118 [0x5a84e8]
[1:5] [0x614c81]
[1:6] _ZN15CkIndex_CkArray29_call_recvBroadcast_CkMessageEPvS0_+0x410 [0x680800]
[1:7] CkDeliverMessageFree+0x8a [0x5ca18a]
[1:8] [0x5b29f8]
[1:9] CsdScheduler+0x59f [0x80249f]
[1:10] [0x7e8db7]
[1:11] [0x7e638a]
[1:12] +0x7dc5 [0x2b5204c08dc5]
ps:
** [1] Stack Traceback: (for Mem Threshold (MB): 1600000000)
[1:0] _ZN14BlockScheduler13registerBlockE9CkIndex2D+0x1f2 [0x5490e2]
[1:1] _ZN5LUBlk4initE8LUConfig12CProxy_LUMgr21CProxy_BlockScheduler10CkCallbackS3_S3_+0x133 [0x539ea3]
[1:2] _ZN5LUBlk8_when_20EPN13Closure_LUBlk18startup_25_closureE+0x1f4 [0x4d8944]
[1:3] _ZN13CkIndex_LUBlk35_call_schedulerReady_CkReductionMsgEPvS0_+0x259 [0x53c8e9]
[1:4] CkDeliverMessageReadonly+0x118 [0x5a84e8]
[1:5] [0x614c81]
[1:6] _ZN15CkIndex_CkArray29_call_recvBroadcast_CkMessageEPvS0_+0x410 [0x680800]
[1:7] CkDeliverMessageFree+0x8a [0x5ca18a]
[1:8] [0x5b29f8]
[1:9] CsdScheduler+0x59f [0x80249f]
[1:10] [0x7e8db7]
[1:11] [0x7e638a]
[1:12] +0x7dc5 [0x2b5204c08dc5]
- [charm] CharmLU optimization for heterogeneous run (Broadwell+Intel Xeon Phi 7250), Ekaterina Tutlyaeva, 01/20/2017
Archive powered by MHonArc 2.6.19.