-
Notifications
You must be signed in to change notification settings - Fork 23
/
Copy pathbenchmark.doc
408 lines (262 loc) · 11.2 KB
/
benchmark.doc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
*******************************************************************************
10/2004
vxpair benchmark on kiwi cluster
Run to t=.1
cpu time per timestep
cpu 1024x512 2048x1024 4096x2048
1 .02509 .10445 too big
2 .02276 .09262 .38686
4 .01744 .06993 .27939
*******************************************************************************
3/25/03
LAMPI (CVS 3-24-03), QSC
256^3
w/o Reliability HP-R12
1x1x8 0.625 0.628 .604
1x1x16: 0.36340
1x1x32: 0.18887
1x1x64: 0.08524
512^3
1x1x64 0.86919 0.87976
1x1x128 0.43432
1x1x256 0.207
16 32 64
CNLS cluster .759 .366 .190
PINKISH:
mpirun --gm --nper 2 lampi 2 per node
256^3
1x1x8 2.02639, 1.98
1x1x16 .766 1.03
1x1x32 .460 .475
1x1x64 .206 .199
1x1x128 .11476, .11304
512^3
1x1x128 1.24
**********************************************************************************
hi-res timings on QA, AFTER CPU upgrade:
using MPI_64bit_R5 unless otherwise noted
iso12 run, 512^3 runs with delt=.0003, t=1=1 eddy turnover time
3333 timesteps per eddy turnover time
assume 1024^3 needs 6666 timesteps per eddy turnover time
= 5 days per eddy turnover time on 512 cpus.
1536^3 needs 9999 timesteps. 2min per timestep, 14days.
2048^3 needs 2*6666 timesteps. at 2.70m per timestep
(2048 cpus, 2 rails), that's 25 days for 1 eddy turnover time.
R_lambda=750 kmax*eta=1
2*6666*4.1 = 38 days per eddy turnover time
R_lambda=1100 kmax*eta=1
4096^3 = 2432 days per eddy turnover time.
Umax = max( |u| + |v| + |w| )
CFL=1.5 = Umax delt/delx
1024^3 144Gb 24Gb per variable 1b gridpoints
1536^3 486Gb
2048^3 1152Gb 192Gb per variable
3072^3 3240Gb needs at least 1620 cpus. 6x1x512?
try 3x1x512
cpu time per timestep
(running 5 timesteps)
512^3 SHI_YI
1x1x64 .813 (with forcing) 7.6s
1x1x128 .393 (with forcing)
1x1x256 .195 (with forcing)
2x1x256 .158 (with focfing)
4x1x256 .116 (with forcing)
4x2x128
2x2x256 (doesn't work)
1024^3
1x1x128 3.76 (with forcing)
1x1x256 1.89 (with forcing)
1x1x512 1.04 (with forcing)
2x1x512 0.684
4x1x412 0.539
1536
1x1x384
1x1x512
1x1x768 2.05
2x1x768 1.50
3x1x768 1.46
2048^3
1x1x512 failed to load. try from scratch1?
1x1x1024 (with focing: 12.8min! 3x slower than old timing. try again? IN
1x1x1024 4.18, 4.18, 3.63(2x rail)
2x1x1024 3.04 with 2x rail: estmate: 2.70
4x1x1024
on QB
2048^3
1x1x1024 7.22 (1 rail)
3.45 (2 rails) a little better than 3.63.
3.61
2x1x1024
1024^3
1x1x512 1.00 (1 rail) as expected
**********************************************************************************
512^3 benchmarks
18Gb memory
134M gridpoints
cpus cpu per timestep
(averaging over 5 timesteps)
O3K 64 1.40
128 .697
256 .332 1419steps, diags & output:.5124
2x1x256 512 .394
500^3 125 .55
500^3 2x1x250 500 .209
old bluemountain numbers:
500^3 on 125 cpus: .860
512^3 on 64 cpus: 1.93
512^3 on 32 cpus: 5.0
QA:
64 .831
128 .422
256 .221
(2x1x256) 512 .169
(4x1x128) 1024 .954 ouch!
1536 benchmarks
03K will not load.
**********************************************************************************
hi-res timings on QSC and QA, before CPU upgrade:
iso12 run, 512^3 runs with delt=.0003, t=1=1 eddy turnover time
3333 timesteps per eddy turnover time
assume 1024^3 needs 6666 timesteps per eddy turnover time
= 5 days per eddy turnover time on 512 cpus.
2048^3 needs 2*6666 timesteps = 40 days per eddy turnover time
on 1024 cpus.
2048^3 1152Gb 192Gb per variable
3072^3 3240Gb needs at least 1620 cpus. 6x1x512?
try 3x1x512
1024^3 timeings on ASCI Q
5 timesteps
dns: 1x1x512 6.20min
dnsghost:
1x1x512 3.17min
4x1x256 3.08
REDO:
2048^3
dns: 1x1x512 45.84 (5 timesteps) 9.17m per timestep
1x1x1024 5 timesteps 4.35m per timestep
2x1x1024
4x1x1024
dnsghost 8x8x8 6.50 (1 timestep)
dnsghost 2x2x128 insufficient virtual memory
dnsghost 8x4x16 same
dnsghost 4x4x32 same
1x1x1024 size per cpu: (2052*2052*6)*8*3 *6 = 3.4Gb per cpu.
2x1x512 wont load 2.2Gb per cpu
dhsghost 4x1x256 5 timesteps: 2.09m per timestep 1.7Gb per cpu
dhsghost 2x2x256 5 timesteps: 2.28m per timestep
1536^3
dnsghost 1x1x512 wont load
dnsghost 8x8x8 2.70m 1 timestep
dnsghost 2x1x256 1.73m
dnsghost 2 2 128 2.00m
dnsghost 4 2 64 2.16
dnsghost 4 4 32 2.27
dnsghost 8 4 16 2.32
(4 + 1536/n1)(4 + 1536/n2)(4 + 1536/n3)
**********************************************************************************
hi-res timings on QSC, before CPU upgrade:
t=-1 (one timestep)
2048^3 11.05m 32K timesteps: 245 days.
t=-5 (5 timesteps)
2048^3 46.83m = 9.37m per timestep.
t=-43 (.05) Nirv BlueMtn 10eddy
one box timings: 256 on 32 cpus > 15min
256 on 64 cpus 9.11m 8.32
256 on 128 cpus 7.65m 4.11m
250 on 125 cpus 7.65m 3.98m
400 on 5x1x25 27.03m
400 on 1x100 32.09m
t=-5 (5 time step) Nirv Bmtn Q 8K timesteps
512^3 on 32 cpus: 12.43 13.8d
512^3 on 64 cpus: 5.07 5.6
512^3 on 128: 2.54 2.8
512^3 on 256: 1.22 1.4
500^3 on 500: 0.94 1.0
500^3 on 125 cpus: 4.30 4.7d
512^3 on 64 cpus: 9.66 10.7
512^3 on 32 cpus: 25.0 27
16K timesteps
1000^3 on 500 cpus: 5.64 13d
1024^3 on 256 cpus: ?
t=-20 (20 time steps) Nirv Bmtn Q 4K timesteps
256^3 on 32 cpus: 5.72m 19h
256^3 on 16 cpus: 10.52m
256^3 on 8 cpus: 18.97m
256^3 on 128 cpus: 6.89 3.97(how many timesteps???)
check bluemountain output files
**********************************************************************************
LOW RES benchmarks, 6/2002
cvs tag: 'bench602'
./bench.sh 96 0 20
121M
20 timesteps
1cpu 2cpu 3cpu 4cpu 6 8
milkyway .16 .080 .054
sulaco/lam .21 .17 .17
sulaco/lampi .22 .17 .60
shankara.old (2x2) .28 .19 .093 .064 .050
shankara 4ring .27 .14 .059 .062 .047
shankara-LAM(ethernet).27 .200
sixhitch .29
pinkish .25 .050
pinkish-lampi .037
-xW vecorization. cant use on Athlon, doesn't help that much:
sandboxx with -xW .202
sandboxx w/o -xW .207
./bench.sh 160 0 20
544M
milkyway 1.22 .42
autrey 1.40
shankara.old 1.76 .62 .29
running 4 cpus, 1 cpu per node: .51
./bench.sh 192 2 20
980M
./bench.sh 240 4 20
1.9G
1cpu 2cpu 3cpu 4cpu 6 8 16 24
shankara (16 cpus = 2x1x8) 4.53 1.27 .737 .347
./bench.sh 256 8 20 (can run on 16 and 24 cpus?)
2.3G 1cpu 2cpu 8 16 24 32 64
milkyway 6.91 3.33
shankara.old 1.61
shankara 4ring 1.53
shankara-LAM 3.90
shankara ? .750 ?
CNLS cluster .759 .366 .190
QSC-lampi: .625
pinkish 2.03 .471 .113
./bench.sh 288 8 20
3.2G
./bench.sh 512 64 40
18.4 cpus: 32 64
CNLS cluster 2.13
**********************************************************************************
**********************************************************************************
LOW RES benchmarks. around 11/2001?
Spectral Model 1 CPU timings.
1 CPU 96^3 120M
*****************************************************************
./bench.sh 96 0 5
run 5 timesteps. CFL=1.5/.25, no structure functions
darkstar (linux 1Ghz)
PGF90, FFT99, auto 2.90 2.90
PGF90, FFT99, params 2.86 2.87
PGF90,FFT99, auto 1.16 1.15
IFC,FFT99, auto 1.07 1.07
IFC,FFT99, params 1.05 1.07
LF90,FFT99, auto 1.28 1.27
autrey (linux 1.? Ghz )
PGF90,STK2+PGCC, auto 1.41 1.40
PGF90,STK2+GCC, auto 1.53 1.53
IFC,STK2+GCC, auto 1.45 1.45
LF90,STK2+GCC, auto 2.01 2.01
Nirvana SGIFFT 2.92 2.92 2.93 2.94
Nirvana FFT99 2.97 2.97 2.97
Nirvana STK
Truchas (Compaq ES40)
FFT99,params 1.03 1.03
CPQ,params 1.00 1.00
STK2,auto 1.07 1.07
FFT99,auto 1.03 1.03
CPQ,auto 0.99 0.99
*****************************************************************