runs.log

Tags:  (see below for more details)

NSF1:  first Tag of code for NSF petascale RFP:
NSF2:  Tag of debugged dnsp code
NSF3:  Tag dnsp + instrumented FFT and A2A overlaop 

GPL1:  Tag of code before GPL branch   (7/22/2008)


==========================================================================
2/13/2009

ANL Intrepid dns x-pencil vs. p3dfft timings.


dnsp    original x-pencil code
dnsp2   x-pencil code with optimization to remove y-transposes
dnsp3:  code use P3DFFT


dnsp                per timestep
512^3 (8x1x256)      0.02036 min     

dnsp2               per timestep
512^3 (8x1x256)      0.02020 min     
2048^3 (32x1x1024)   0.20168
2048^3 (128x1x256)   0.18770

dnsp3            
512^3 (1x8x256)      0.01794 min  
2048^3 (1x32x1024)    0.12643
2048^3 (1x64x512)     0.12675     
2048^3 (1x128x256)  


TIMING JUST THE FORWARD TRANSFROM
Mark Straka's timings:
out.4x1024.2048_NEW_ESSL.output: proc_id, cpu time per loop 0 0.98941431058823536
out.4x1024.2048_NEW_FFTW.output: proc_id, cpu time per loop 0 1.34859306352939257

My best ESSL time translates (dividing by 8, assuming perfect scaling :)) to a 
time for 32k procs of about 0.125s.  (0.1686 for FFTW)

My timings:

2048^3 1K nodes 8K cores
p3dfft 4-1024   2.895       foward & back
p3dfft 4-1024   1.249       forward only


2048^3 8K nodes 32K cores
p3dfft 128-256           0.5517  0.5497     (ERROR! timing *both* directions)
p3dfft  64-512           0.3704  0.3705     (ERROR! timing *both* directions)
p3dfft  32-1024          0.1855  0.1860     (fixed, timing only forward tnfrm)

dnsp2 128-1-256        0.3956  0.4093
dnsp2 64-1-512         hung
dnsp2 32-1-1024        0.6754  0.6723

why does that one hang? scaled down version:
on each proc:  32x2048x4
64^3 
dnsp2: 2x1x16


*******************************************************************************
7/22/2008

NSF1:  first Tag of code for NSF petascale RFP:
NSF2:  Tag of debugged dnsp code
NSF3:  Tag dnsp + instrumented FFT and A2A overlaop 

GPL1:  Tag of code before GPL branch   (7/22/2008)

   GPL1_branch:     branch at GPL1
   GPL1_branch_tag1 tag after adding all GPL stuff
   GPL1_branch_tag2 a few more edits, including #define TRUNC in params.F90
   passes these tests:
     ../testing/test3d_forcing.sh r
     ../testing/test.sh 1r


On trunc, I then ran tried:
   cvs update -j GPL1 -j GPL1_branch_tag1 src
   BIG MISTAKE - deletes the analysis*F90 routines deleted in GPL1_branch
   thus I gave up and added GPL headers to trunc by hand.


*******************************************************************************
9/6/2007
balu's PDF idea
running balu2.inp at 1024^2 on 64 blackrose nodes (256 cores)
cost to run time=10:  12.68min

balu_a


*******************************************************************************
8/25/2007

testing boussinesq vs dns:
dnsb with theta:   10 timesteps:    .32      23% more expensive  
dns  with 0 scalars, 10 timesteps:   .26  

64^3 problem, 2 cores on dosadi F7


*******************************************************************************
12/02/2006
sc2048decay
  restart from sc2048A below, at time t=1.0, but with no forcing.
  added u3 to diagnostics.  See how this decays for Jens Lorenz.

  decay2048 does not uave u3 or u4:
     they are in turb-diag now, but they were not present at the time
     decay2048 was run.  Also, u3 does not have the absolute value.


*******************************************************************************
11/23
Benchmarks on RS/dualcore with FFTW

4096^3   done
3072^3   done
2048^3   done
1024^3   done
512^3    done

Redoing with A2A overlap:

4096^3  done
3072^3 done


Bandwidth numbers:
Running Ramanan's transpose code (MPI_alltoall in subcommunicators)

Min. over all runs, of the max time
            tr max  tr+copy
4096x4    .00636   .2886 (???)  
4x4096    .00185   .00409
128x128   .00444   .0136


DNS code. 4096^3, using a 4x1x4096 decomposition:

transpose_to/from_x:    .00725     (this is like the 4x4096 case above)
transpose_to/from_z:    .0123      (this is like the 4096x4 case above)


*******************************************************************************
11/14
NOTE:  11/23: FFTW routine is over 2X faster than FFT99,
       so i stoped with these benchmarks and I'm rerunning above


dns code:
RS size -d output:
4096:  804307056 * 18384 =  12272 GB         512GB per array
       511175888*8*4096  = 15599GB 
3072:  1417862352*3072 = 4056 GB   (19 arrays)
2048:  637377744*2048 = 1215 GB    (19 arrays)

at high CPU counts, it looks like there is some storage growing
faster than linear?  


RS Jumbo runs
dnsp = 20 arrays =  20*8*N^3 =  1.49e-7 N^3 GB

4096^3   10240 GB   min VN mode:   16384
16384    allocate() fails.
         try with static allocation  IN-Q


3072^3        4320 GB      VN mode:  6144
CO mode
3072nodes  

VN mode
18432       done
9216cores   done
6144cores   not enough memory.  should require .7gb per core.  
6144cores   ok w/o MPICH_UNEX_BUFFER_SIZE 180M)


2048^2     1300 GB    min VN mode:  2048 cpus.
VN mode
16384  done
8192   done
4096   done
2048   ?


1024^2    160GB    min VN mode: 256 cpus

16384     ?
8192      ?
4096->256   done


512^2   ?   redo this data?
16384  
8192   
4096   


*******************************************************************************
8/23
FLOP count runs

N^3 = 16,24,32,48,64
time = -1,-2,-3      

dnsp
forcing12-bench.inp  (but set to 1,2 or 3 timesteps)


Couldn't get opcontrol to work.  (lapic not used?)

see if FLOPS is possible:
0. sudo opcontrol --list-events

1. sudo opcontrol --no-vmlinux
2. sudo opcontrol --start    
--event=FLOPS
3. ./dnsp -i temp.inp 
4. opreport -l ./dnsp
2. sudo opcontrol --shutdown
EVENT = FLOPS


On Cobalt: (NCSA)
   qsub -I -V -l walltime=00:30:00,ncpus=8,mem=8gb

   pfmon -efp_ops_retired dnsp -i temp.inp


%module load histx+.1.2a
%lipfpm -h
%lipfpm -e CPU_CYCLES -e FP_OPS_RETIRED ./a.out


resolution:  16^3         flops             FLOP per timestep:
    -6                   80161720    diff:    8949342
    -5                   71212378    diff:    8949549
    -4                   62262829    idff:    8949448
time=-3(4 timesteps):    53313381    diff:    8949364
time=-2(3 timesteps):    44364017    diff:    9230453  
time=-1(2 timesteps):    35133564


resolution:  32^3           flops          diff
     -6                   696585740 78413858
     -5                   618171882 78414194
     -4                   539757688 78413964
     -3                   461343724 

     -3 NCPU=2              65139
                          1834353


resolution:  64^3           flops          diff
     -6                 6189080299        702394764
     -5                 5486685535        702395163
     -4                 4784290372        702395166
     -3                 4081895206

resolution:  128^3           flops          diff
     -4                  41068986552      6062569260
     -3                  35006417292
resolution:  256^3           flops          diff
     -4                  358258523792    53102820392      26min
     -3                  305155703400


resolution:  12^3           flops          diff
     -4                   26885228        3873213
     -3                   23012015

resolution:  24^3           flops          diff
     -6                   300568043  33806552  
     -5                   266761491  33807033
     -4                   232954458  33806951
     -3                   199147507 
resolution:  48^3           flops          diff
      -6                 2662196785 302423113
      -5                 2359773672 302423661
      -4                 2057350011 302423679 
      -3                 1754926332 

resolution:  96^3           flops          diff
     -4                   17571976245      2595103294
     -3                   14976872951

resolution:  192^3           flops          diff
     -4                  153149117982   22711603057
     -3                  130437514925                    (5min)


Fit to:  c0 + c1 N^3 + c2 27 N^3 log2(N):   (because we do 27 FFT's)


clear; N=[16,32,64,128,256];
for i=1:length(N)
   A(i,:) = [ 4*N(i)^3 , 4*27*log2(N(i))*N(i)^3    ]/N(i)^3;
end
b = [8949342   ;  78413858 ;    702394764; 6062569260; 53102820392]' ./ (N.^3);
c=A\b'; % solve A * [c0;c1;c2] = b
semilogx(N,b,'o'); hold on;
x=12:256; f=(c(1)*4*x.^3+c(2)*4*27*log2(x).*x.^3)./(x.^3);
plot(x,f); hold off;
disp(sprintf('%10.1f  N^3 +  %5.3f 27 N^3 log2(N) ',c(1),c(2)));
(A*c - b')./b'
# answer:  4(  297 N^3 + 2.28 27 N^3 log2(N) )
# error:  less that 1%


clear; N=[12, 24,48,96,192];
b = [ 3873213; 33806552 ; 302423113; 2595103294;22711603057 ]'./(N.^3);
for i=1:length(N)
   A(i,:) = [ 4*N(i)^3 , 4*27*N(i)^3*log2(N(i)) ]/N(i)^3;
end
c=A\b'; % solve A * [c0;c1;c2] = b
loglog(N,b,'o'); hold on;
x=12:256; f=(c(1)*4*x.^3+c(2)*4*27*log2(x).*x.^3)./(x.^3);
plot(x,f); hold off;
disp(sprintf('%10.1f  N^3  +  %5.3f 27 N^3 log2(N) ',c(1),c(2)));
(A*c - b')./b' % residula
# answer:  4(  340 N^3 + 2.24 27 N^3 log2(N) )
# error:  less that 1%


*******************************************************************************
8/23  

2048^3:  5 days on 4096 cores.  
3072^3:  25 days on 4096 cores.


Redstorm run:

sc2048A:  (see sc2048A.job, sc2048A.inp)
This run should take 13K timesteps.  on 4096cores, .5min per = 120h  
2048 run modeled after sc1024A:
sc2048A data:  
   ke = 1.95  eps=4.00  mu=1.35e-5    kmax=965*2*pi
   kmax*eta = .959    eta=.000158     ett = .976

eyeball averages over last 25% of run:  
   ke = 1.90  eps=3.8  mu=1.35e-5    kmax=965*2*pi
   (and this agrees better with sc1024A data)        R_lambda = 700


sc1024A data:
   ke=1.886  eps=3.58  mu = .35e-4
   eta*kmax_spherical = eta * nx*pi*2*sqrt(2)/3 = 1.0

sc3072A modeled after sc2048A data:
   ASSUME eps=4.00  kmax=1448*2*pi  
    nu = eps^(1/3) k_max^(-4/3)  = .84e-5

IN, but always gets ec_node failuer errors.

dsacp  rs:/scratch2/mataylo/sc2048A  hpss:dns


*******************************************************************************
7/27  

NSF1:  first Tag of code for NSF petascale RFP:
NSF2:  Tag of debugged dnsp code
NSF3:  Tag dnsp + instrumented FFT and A2A overlaop 


*******************************************************************************
7/20/2006

N=12288
Rl = 2000
delt=.0001 ett
vorticity, velocity, pressure saved every .02 ett

my delt:  sqrt(ke) delt/delx = .17  (see below)


ett = 2*ke/eps
lambda=sqrt( mu*(2*Euse/3) ./ (epsilon/15)  );
Rl = lambda*sqrt( 2 KE/3) / mu
Rl = KE sqrt(20/3 / (mu*epsilon))  
eps = mu * grad(u)^2
eta = (mu^3/epsilon)^.25  = delta_x/3           mu ~= N^(-4/3)
                                                      2x = .40
                                                     12x = .036


Resolution Requirements  N^3, 2/3 dealiasing:
     ke=1.886  eps=3.58 

     eta*kmax = eta * 2*pi*(N/3)
     eta*kmax = (mu^3/epsilon)^.25 * N*pi*(2/3)
     mu^3 =  eps*[(eta*kmax)/(pi*2N/3)]**4 
     mu   =  eps**(1/3) * [(eta*kmax)/(2*pi*N/3)]**4/3 
     Rl   = ke sqrt(20/3/(mu*eps)


12288 run modeled after sc1024A:
   ke=1.886  eps=3.58  mu = 1.25e-6   

   delt = 1e-5   sqrt(ke) delt/delx = .17
   Rl = 2290
   eta*kmax_spherical = eta * nx*pi*2*sqrt(2)/3 = 1.0
   eta*kmax_2/3       = eta * nx*pi*(2/3) = .70

12288 with Rl=2000 
   ke=1.886  eps=3.58  mu = 1.6e-6   

   delt = 1e-5      sqrt(ke) delt/delx = .17
   Rl = 2034                 
   eta*kmax_spherical = eta * nx*pi*2*sqrt(2)/3 = 1.2
   eta*kmax_2/3       = eta * nx*pi*(2/3) = .84


12288 with Rl=2000 
   ke=1.886  eps=3.58  mu = 2e-6

   delt = 1e-5      sqrt(ke) delt/delx = .17
   Rl = 2034                 
   eta*kmax_spherical = eta * nx*pi*2*sqrt(2)/3 = 1.2
   eta*kmax_2/3       = eta * nx*pi*(2/3) = .84


sc4096A:  should take 20 days on 16K cores
sc3072A: 8.3 days on 12K cores


sc1024A runs:
eps:  3.58
eta: .000331    eta/delx = .3388
     spherical:  eta * kmax = eta* nx*pi*2*sqrt(2)/3  = 1.0034
     2/3:        eta * kmax = eta* nx*pi*(2/3)        = .7095

Rl=435
ett=1.05
mu=3.5e-5
maxU = 6.18,6.33,6.04  
maxUcfl = 11.77
ke = 1.886      sqrt(ke)=1.37  =  maxU/4.5
eps=3.58               k eps-2/3 = .806
delt ~ 1.2e-4


My CFL:  (delt=.0001)   
11.77 * delt/delx = 1.5   delt< .13/N                OR
8.6 sqrt(ke) delt/delx = 1.5             1.9*maxU delt/delx = 1.5
sqrt(ke) delt/delx = .17                       maxU delt/delx = .8

Pope:   sqrt(k) delt/delx = .05
        maxU delt/delx = .22

64^3 forcing12.inp case (ke=1.71)  
.005  sqrt(k) delt/delx = .41  blows up  12 timesteps
.0045 blows up t=3
.004  sqrt(k) delt/delx = .333 stable to t=10


eta*kmax=1, eps=3.58
  mu   =  eps**(1/3) * [(eta*kmax)/(2*pi*N*sqrt(2)/3)]**4/3                             
       = .36 N**(4/3)
         mu          mu (formula)    
 512    1e-4          .88e-4
 768                  5.1e-5        cpus: 128, 192, 256, 384, [2,3,4,6,8]x384 
1024   3.5e-5         3.5e-5
1536                  2.0e-5
2048                  1.38e-5
3072                  8.06e-6
4096                  5.5e-6  
12288  1.25e-6        1.3e-6


*******************************************************************************
3/6/2006
Kerr's compression of our 2048^3 data set project.

TODO:
PYTHON:  read compress.o
         find time in decay.out
         print out comparision:    scalars.m, 2048^3 read

how to get files off of HPSS and onto lindor?
   ccs-2 machine "lindor":  /usb1 /usb2 /usb3

   psi                  doesn't support 2GB files
   sftp hpss.lanl.gov   connection refused

to get files off of HPSS onto QSC:
xpsi get --pfsComp 32 filename


2048^3 = 192 GB snapshots    6min to read
1440^3 = 67 GB snapshots     22min to read/write
1440z  = 22 GB snapshots      1min to read/write

2048 -> 1440      grid to spectral truncated               3 arrays = 192 GB
6min to read, 22min to write

1440 -> 1440      spectral to compressed, save stats!      8 arrays = 178 GB
22min to read,  1min to write

1440 stats                                                 8 arrays = 178 GB
1min to read

70 files, 192GB = 13TB
           22GB = 1.5TB


list of times:

0000.4020   
0000.4026 
0000.4188 

0000.4328 
0000.4603 
0000.4894 
0000.5551 
0000.6034 

0000.6491 
0000.7019 
0000.7536 
0000.8149 
0000.8545 

0000.9040 
0000.9512 
0001.0017 
0001.0511 
0001.1038 

0001.1598 
0001.1959 
0001.2500 
0001.3081 
0001.3457 

0001.4034 
0001.4434 
0001.5038 
0001.5439 
0001.6124 

0001.6586 
0001.7075 
0001.7569 
0001.8033 
0001.8484 

0001.8988 
0001.9493 
0001.9955 
0002.0416 
0002.1140 

0002.1581 
0002.2066 
0002.2609 
0002.3157 
0002.3700 

0002.4247 
0002.4519 
0002.5090 
0002.5500 
0002.6100 

0002.6700 
0002.7021 
0002.7621 
0002.7922 
0002.8522 

0002.9181 
0002.9511 
0003.0145 
0003.0812 
0003.1481 

compress-in:   started 5/2 evening 
0003.2146 
0003.2838 
0003.3563 
0003.3924 
0003.4596 

save-in
0003.5279 
0003.6027 
0003.6781 
0003.7527 
0003.7900 

=============


Test case on D800:
   64^3  using "decay.inp" initial condition
   temp0001.0000.[uvw]


OS/porting glitches:  (D800, LAM MPI)
MPI_REAL8 not defined.  put in MPI_REAL8=MPI_DOUBLE_PRECISION
put in MPI_OFFSET for arguments to mpi_file_seek.


bug (fixed): mpi_bcast on character*16 arrays:
    length argument to mpi_bcast has to be multiplied by 16,
    since each string has 16 characters.

bug (fixed) bottom of subroutine SETUP:  
      Most of the input is read only be process 0 and then
      broadcast to the other processes.  But at the end of
      SETUP all processes were doing some input file reads,
      and for me, the my_id<>0 processes were crashing.


bug (not fixed) SPEC7
     in RSSTIO, the call to spec7() hangs.  I tracked this down
     to the mpi_allreduce of length "m1".  Two processes have m1=0,
     while the remaining processes have m1=6.

     The calls to spec8, spec9 and spec10a also hang, but 
        I didn't track down what was causing this.  

     I just commented out all of these calls.

bug (not fixed) FSPASS
     same "m1" problem.  In the mpi_allreduce, 6 of the 8 processes think
     m1=3, but the remaining processes thik m1=0.

     This problem I tracked down, but it took my quite a lot of 
     debugging.  The problem turns out to be with 8 processors,
     using your c16 data file, some of the processes do not 
     execute the bug "j1" loop in FSPASS, and hence they do not
     call SPEC23D (which is where m1 seems to be set), and so 
     when the call SPEC23D_T, they have m1=0 and the all_reduce fails.


Set jdebug=0 on Kerr's advice to avoid all of the above problems.
on 8 cpus, it hangs. On 4 cpus, there is a problem in RSPASS,
the allocate statement, the second dimension, nword3b=0

RSPASS: nword3b=0  disable allocate on Kerr's advice

It seems uw still needs to be allocated, so I only dont
allocate u if nword3b=0.  
then I got a strange error about memory allocation, that went
away if i added ",STAT=istat" to all the allocate statements
(and I check istat after each one to make sure the allocate
worked).  

1 and 4 cpus the code now runs and produces output,
but the output is different.  I've attached:

1.out      stdout for 1 cpu run
fsave.1    'fsave' output file
4.out      stdout for 1 cpu run
fsave.4    'fsave' output file

on 2 cpus, the code crashes (I haven't tracked this down yet)
and on 8 cpus the code still hangs, but I haven't tracked that
down yet either.


2 cpu code: added another STAT= and fixed the problem.

8 cpu case:  problem is m1=3,0 in spec23_d


*******************************************************************************
6/30/2005
rotation case (modeled after leslie smith's run)

to add passive scalar:
./gridsetup.py 1 1 32 128 128 128 2 2 2 0 0 0 4


256x256x32    R0=.88, .48,  .16

R0 = (epsilon_f * (2pi k_f)**2 )**(1/3) / (.5*fcor)         

Leslie's runs:
   fcor =13.8  R0=.88  E saturates
   fcor = 76   R0=.16  E monotone increasing.  ran for t=170 = 6460 revs.
 
2 Omega = fcor

new input file: 
initial cond:  none   subtype=0
forcing:  iso_high_16 
Lz   = 1/8
fcor = choosen to achieve R0 given above
Bous = 0

hyper4 = del**4 = laplacian**2
coefficient = 1   (auto scaled)


My first runs:  (dosadi) not saved.  used for debugging and tuning:

128x128x32   
k_f = 8 fcor=15 mu_hyper=.01, .1, 1.0, 10.0
epsilon_f ~= .5
epsilon = .41

Everything looks good: but no inverse cascade with fcor=15
had to go up to fcor=150 to get an inverse cascase.


New runs:

rotA  128x128x32   
   k_f=16  fcor=15  mu_hyper=1.0  mu=0
   epsilon_f   .52
   epsilon_ke  .52
   R0 = 2.32
   No inverse cascade.  

rotB  128x128x32                                    
   k_f=16  fcor=40  mu_hyper=1.0  mu=0

   eddy turnover time: .72  
   epsilon_f   .52
   epsilon_ke  .51
   R0 =  .87 
   KE:  .1 -> .22  in time=10

1/Omega  = time for one revolution: 
time * Omega = number of revolutions of run


Grashof number:  ||f|| L^(1.5) / nu^2

*******************************************************************************
10/3/2004

New Monika VXPAIR case:fcor, Z scale and Bous paramemter are in input file


init_subtype==6 (see cases2v.F90)
delta=.1
viscosity = 1e-7 


./gridsetup.py NCPU 1 1 1 4800 2880 1 2 2 0 2 2 0 2
./gridsetup.py NCPU 1 1 1 2400 1440 1 2 2 0 2 2 0 2

KIWI: 4 cpus  time(min) per timestep:  .468   timestep ~ .0011s
      2 cpus                           .644

qsc:  32  .062  
      64  .0371  1.09 days
shankara:  (some of these timings were not using 2 cpu per node,
            I should redo them)
2      .666
4      .391
8:     .255 
16:    .161   (4.7 days to t=50)
24     .109   3.2 days

shankara-lam
4:     1.51
16:     .738


delt=.0012
time_per_timestep(m)*50/.0012/60 = time for run in hours


vx4800a   delta=.1, viscosity=1e-7 init_subtype=6
          4800x2880  
          yscale=1.8  xlocation=1.5    

vx2400a:  same input file as above
   delta = .1, viscosity=1e-7
   ubar = 0.098
   jack ran this problem:  2640x1520  domain:  [-1.7 , 1.6]  x [0 1.9]
   I'm running:  2400x1440  domain:  [-1.5 , 1.5]  x [0  1.8]
IN shankara

NEXT:?
vx4800b  NEW RUN with delta=.05, viscosity=1e-7.  same ubar? init_subtype?


**********************************************************************************
10/3/2004

Evelyn found a bug during restart, at 2048^2?  (or was it 4096^2)
Checking 2048^2 on 4 cpus (kiwi)

./gridsetup.py 16 1 1 4096 2048 1 2 2 0 2 2 0   2

eve1 (done)   output: .01, .02
eve2 (done) restart from .01, run to .02


**********************************************************************************
1/30/04
  fractional sturcture functions for Susan Kurien
  exponents:  -.8, -.6, -.4, -.2, ...

  que'd up all runs
         complete   archived    cnslgw
  2.4      x           x          x
  2.3      x           x
  2.2      x           x           x
  2.1      x           x
  2.0      x           x
  1.9    IN                   floating point divide by zero?
  1.8      x           x
  1.7      x           x
1.6     RUN                      missing data - dont use
1.5      x
1.4       x
1.3        x           x
1.2       x
1.1       x
1.0        x           x


**********************************************************************************
12/1/03
cospec results:

  cospec from sc1024A:  looks like -7/3?
  cospec from decay2048:   IN

  cospec from subcubes:  


**********************************************************************************
11/6/03

step 1:
 set restart=0
 run tmix256D-noscalars.inp     used to generate initial condtion
 run to t=.3

step 2:
 set restart=1
 leave "name=tmix256D-noscalars", but set "refin=tmix256D-rescale.inp"      
 run to t=0, just to generate rescaled initial condition
 rename the output:  tmix256D0000.3000-noscalars-rescale.* tmix256D000.3000.*

step3
 set name=tmix256D
 rename directory tmix256D-noscalars tmix256D 
 run restart run, with compute new passive scalars

step4
  regular restart (uvw and passive scalers)


tmix256D input file parameters:
   init_cond_subtype==3
      spectrum peaked at k=6 (instead of 10)
      lowered mu from 3e-4 to 1.75e-4 because of change in k_peak.
      (kmax*eta at t=0 is 1.5)
   Change KE corr. scalar from type 1 to 3:   ke_thresh = .5?   debug.


add to matlab:  Sk_{ln eps_c} = <psi^3>/<psi^2>^{3/2}
                K_{ln eps_c}  = <psi^4>/<psi^2>^{2}


Check with Ray:

velocity:  now peaked at 6, not 10.  slope still k**2 for k<6
scalars:
Gaussian scalar:       same as before: double delta
KE correlated scalar:  distribution chosen so that: c=1 peak 6x larger than
                             c=0 peak.  

need to change subtype and scalar type in new .inp file.


**********************************************************************************
10/24/03

VXPAIR  Kras initial condition

init_cond_subtype=100


./gridsetup.py 16 1 1 640 512 1 2 2 0 2 2 0   2
make dnsvor

see RUNME script in parent directory


vx2560a
2560x2048   (400MB)   IN     ~9 hours to t=5 on 4 cpus milkyway
   data stored in ~/data/kras  
   init_cond_subtype=100
   mu=1e-6                                  should also be able to do 1e-7?
   hard coded: (code may change later)
      biotsavart_cutoff=5e-3
      biotsavart_apply=50
      delta=.1
      biotsavart_ubar=.100


vx2560b
2560x2048   (400MB)   IN   l1
   data stored in /netscratch/taylorm/kras
   init_cond_subtype=100
   mu=1e-7                                  should also be able to do 1e-7?
   hard coded: (code may change later)
      biotsavart_cutoff=5e-3
      biotsavart_apply=50
      delta=.1
      biotsavart_ubar=.1


vx2560c
2560x2048   (400MB)   IN   l1
   data stored in /netscratch/taylorm/kras
   init_cond_subtype=101
   mu=1e-6                                  should also be able to do 1e-7?
   hard coded: (code may change later)
      biotsavart_cutoff=5e-3
      biotsavart_apply=50
      delta=.05
      biotsavart_ubar=.1

vx5120a                            IN lightning
5120x4096  
   init_cond_subtype=102
   mu=1e-6                               
   hard coded: (code may change later)
      biotsavart_cutoff=5e-3
      biotsavart_apply=50
      delta=.025
      biotsavart_ubar=.1


vx7680a
7680x5760
   run on lightning 128cpus, about 9 hours for t=1.0
   lightning run: ~/data2/kras  (ran to t=1.5)
   shankara run:  ~/data/kras   (ran to t=0.5 19.7h)   
   init_cond_subtype=103   (smaller domain then the other cases above)
   mu=1e-6                               
   hard coded: (code may change later)
      biotsavart_cutoff=5e-3
      biotsavart_apply=100
      delta=.025
      biotsavart_ubar=.1
    

**********************************************************************************
8/26/03

Schmidt number runs. 512^3

S = .01, .05, .1, .5, 1.0  

spectrum peaked at k=10, with k**2

256^3:  mu=3e-4:   R_l=52.6    kmax*eta=1.54      teddy=.56


.70e-4            1.99
.75e-4:           1.92
1.0e-4:  delx/eta=1.7
1.2e-4:  delx/eta=1.5
target: kmax*eta=1.5   delx/eta=1.97


Running on QB: 
tmix512A: 512^3:  mu=:.70e-4   R_l=225    kmax*eta=1.49     teddy=2.4


**********************************************************************************
5/10/03

subcube averaging for Bill:

1. pull .4026.u  .v and .w runs off of HPSS  (done)
2. convert to brick of doubles   (add header_type arg to output_uvw)  (done)
3. run extract_mean.c on data          IN
4. convert back to my format
5. read in, project out, the output u,v,w and p


**********************************************************************************
3/7/02

Balu benchmarks:
512^2     42M
1024^2    147M  
2048^2    566M
4096^2    583M    x 4 cpus:  2332       estimated: 2.3Gb

         
 8192^2   9.2GB    8cpus?           bench: 8 16 32 64 128 256 
12288^2   20.25    12 cpus          bench: 12, 24, 48, 192 
16384^2   37GB     18cpus min
32768^2   147GB    73cpus min       benchmark on: 128 and 256

estimage: 6 arrays * 3 dim * N^2 * 8 bytes = 144 N^2
40000^2: estimated:  214GB
50000^2:             335GB
60000^2:             482GB


8196^2 benchmark QSC:  time per timestep (run 4 timesteps)
 8    3.78
16    2.27
32    1.17
64    0.644  
128   .298
256   .187

12288^2 
 12    6.78988 
 16    5.24599 
 24      3.58278 
 32      2.74846 
 64      1.37784 
128    0.683
256    0.373   


**********************************************************************************
3/4/02

Schmidt number runs. 256^3

S = .01, .05, .1, .5, 1.0  

spectrum peaked at k=10, with k**4

mu=3.00e-4          R_l=70   kmax_eta=1.5     teddy=.7
mu=2.75e-4     run to t=.1, check Re, kmax*eta 
               t=0: R_l=77  kmax*eta=1.4
mu=2.00e-4          R_l=106  kmax*eta=1.2


tmix256A  k**4, kmax=10  mu=3e-4   ran to t=3.00.  

REDO with k**2, kmax=10:
mu=3e-4:   R_l=52.6    kmax*eta=1.54      teddy=.56


tmix256B  k**2, kmax=10  mu=3e-4   IN t=3.00.  
   Check this run, figure out when to re-iniitalize. 
   Then put in Schmidt numbers:  .01, .05, .1, .5, 1.0

tmix256C:  starting from tmix256B, t=1.00
           then restart with tmix256B-rescale
              will run t=0, but will output tmix256B-rescale.[uvw]
              rename tmix256C00001.0000.[uvw], use as restart             

           running to t=.0001 with rescale code turned on. 
                              introduce 10 passive scalars

                                   
**********************************************************************************
2/28/02

VXPAIR
new run  testing voriticty interpolation along a line thru vorticity max

4096x2048 
./gridsetup.py 16 1 1 4096 2048 1 2 2 0 2 2 0   2

Running in /ccs/taylorm/dns/src/vxpair:
moved to sulaco:/home/scratch/vxpair (7/16/03)


vx4096c   mu=7e-8  ran to t=100
          init_cond_subtype=1 (no PSI boundary update)


BUG discovered in comp_ellipse: mxw_init was being reset
every timestep.  Fixed 3/7/02 for vx4096d below:

vx4096d   mu=7e-8   ran to t=100
          init_cond_subtype=2 (PSI boundary update every 50 timesteps)
                               w cutoff 5e-3

          vxline tracers: reset every time they cross

vx4500a  mu=1e-7  DELTA=.1  IN
         init_cond_subtype==5  

**


12288x6144 timings: QSC
./gridsetup.py 16 4 1 12288 6144 1 2 2 0 2 2 0   2
vx12288a   PSI boundary recomputed every 50 timesteps
vx12288b   PSI boundary recomputed every 100 timesteps

time to 0.1.  4.99min = 5.2days

          90 timesteps          90 timesteps             no PSI-recompute
          1 PSI-recompute        no PSI-recompute        no ellipse
64x4:       5.58m                   5.86                   4.99
32x4:                                                      8.77
16x4:                                                     17.16


started vx12288b run on QSC
mu=1e-8
./gridsetup.py 16 4 1 12288 6144 1 2 2 0 2 2 0   2
init cond subtype==4  (biot-savart every 100 timesteps)
vx12288b   PSI boundary recomputed every 100 timesteps
run to t=179 so far.  
data on HPSS:  dns/vx12288b

some data pulled to certs:  ~/data2/vxpair/vx12288b  up to t=189.9


**********************************************************************************
2/21/02
NS-alpha runs.
iso12 forcing
256^3
mu=2.75e-4


iso12_256  (shankara) data in  /ccs/scratch/taylorm/dns/iso12
        alpha=0
        E        1.95
        delx/eta:  2.6
        epsilon_ke:  4.11
        R_l = 147
        

DATA STORED: QSC: /scratch2/taylorm/sk256?
 not backed up.  runs take 4hours on 128cpus.  

sk256: alpha=2x
        E        1.69
        KE       1.91
        delx/eta:  2.6
        epsilon_E:  3.89
        epsilon_ke:  1.90   (non-adiabatic part of d(ke)/dt)
        R_l_E = 149         (using E and epsilon_E)
        R_l_ke  = 191       (using ke and epsilon_ke)

sk256A: alpha=5x
        E        2.37
        KE       1.59
        delx/eta:  2.8
        epsilon_E:  5.4
        epsilon_ke:  1.18   (non-adiabatic part of d(ke)/dt)
        R_l_E = 163         (using E and epsilon_E)
        R_l_ke  = 230       (using ke and epsilon_ke)


sk256B: alpha=10x
        E        3.83
        KE       1.53
        delx/eta:  3.2
        epsilon_E:  9.21
        epsilon_ke:  .88    (non-adiabatic part of d(ke)/dt)
        R_l_E = 216         (using E and epsilon_E)
        R_l_ke  = 275       (using ke and epsilon_ke)


**********************************************************************************
12/16/02

sc1024A started on QSC   RUNNING
1024^3 iso12 forcing
mu=.35e-4
ran to t=2.0 

data for t=1.6 u-component lost.  
redoing .isostr for 1.0 through 2.1 so that the
files also contain helicity information.
done:  1.0, 1.1, 1.2, 1.3 1.4, 1.5    1.7, 1.8,  1.9,  2.0, 2.1
1.6: cant redo this one since u-component data was lost

to pull data:
xpsi get --pfsComp 8 dns/sc1024A/sc1024A0001.000.\[uvw\]


pulled: 1.0, 1.1  1.2IN
isoave:  1.0 IN


**********************************************************************************
12/6/02

VXPAIR


mu=4e-8
./gridsetup.py 24 1 1 6144 3072 1 2 2 0 2 2 0   2

timings:
6144x3072    2 cpu milkyway:  338m per t=1.  
                                                              t=1
6144x3072    shankara:  24x1: mu=4e8  3.74m for t=.05         75m    
                         6x4          3.77  
                        12x1    test?
                         8x1    test?


Run with new ellipse code (uses monika's algorithm) and
fixed .cross output.  (order output has bad .cross data and I've
deleted all the files.  also, which ellipse contour algorithm
was beeing used for the older data?)


SAVED in:  /ccs/scratch/taylorm/vxpair:
(move later?)

all using vx6144c.inp:

vx6144c.out  ran to t=17  (runs deleted)
     65 points per ellipse, 8 contours. 
     R adjustment (after overshoot) in contour search: R=R/2
     contour_eps=1e-7

vx6144d.out   ran to t=39
     32 points per ellipse, 4 contours.  looks jagged
     R adjustment (after overshoot) in contour search: R=R/10
     contour_eps=5e-7  (should be the same as 1e-7)
     
vx6144e.out   ran to t=107
     65 points per ellipse, 4 contours. 
     R adjustment (after overshoot) in contour search: R=R/10
     R init = .1
     R overshoot: back off by 3x
     contour_eps=5e-7  (should be the same as 1e-7)
     

**********************************************************************************
11/27/02

shiyi forced run for Chris.
running on QSC

sc512A   DONE.
R_l = 70 (I think)
delx/eta = .5


**********************************************************************************
10/31/02
QB pre-science run.

times0.dat
1024^3 conversion: everything

times5.dat 
convert to h5 vor:     5 per eddy turnover time
2048^3 analysis:   run 5 per eddy turnover time?


.4020:  last non-rescaled run.
        rescale run started from this time.  
.4027:  rescaled run, after 1 timestep 
        (5/10/2003: i seemed to have archived .4020 and .4026.  
         so I hope rescaled run is .4026.)

Ran ok to t= 2.43282
Then there is a glitch between time step 10 and time step 15.

2.4247        will have to restart from here.
2.4328.[uvw]  ok. (but didn't save?)

2.4328.scalars,spec,spect   bad. 
2.4400.scalars-turb  bad

restart from 2.4328.[uvw].  looks good.

 
1. initial value: restart + rescale.  call daniel to talk about this
       decay2048_e   initial condition
       decay2048_s   rescale the initial condition


50MB/s:  180GB:   61m to write data.  run 6 hours, write data?
500MB/s:           6m to write data.  output every 2 hours = 50h  
         diagnostics:   every 25 timesteps?


UDM benchmarks:
1024^3.  serial 24GB. (1 rail)   t=30.34    13 GB/s

1024^3.  UDM 24GB. (1 rail)   t=9.63:   43MB/s.   
1024^3.  UDM 24GB. (1 rail)   t=13.67   
1024^3.  UDM 24GB. (2 rail)   t=8.96
1024^3.  UDM 24GB. (2 rail)   t=8.89   on QA

16  stripe, pMPI:  140
64  stripe, pMPI:   46
128 stripe, pMPI:   39


2048^3
1 1 1024
1 1 1024 with rail option: UDM 192GB output: t=83.32:  39MB/s
1 1 1024 with rail option: UDM 192GB INPUT:  IN


dai
hal 
ggriter
jnunjez


32  stride          203  
64  stripe, pMPI:   436
128 stripe:         288


MPI_IO tests:
!
! set up all the I/O processors
! and create communicator, comm_io
!  256  128  64   32         stripe=64
!  200  300  338  308
!
!  256  128  64         stripe=128
!       227


2. UDM: test on qidse:
     run test suite with my code w/parallel UDM restart: 
            ../testing/test3d.sh pudm
            works!  tried it with 3 different restart.h5 files,
            created with 3 different parallel decomposisitons

test using 512^3 restart data on QB
1x1x256 decomposisiton
run code 10 timesteps,  restart and run another 10 timesteps
run code 10 timesteps,  restart and run another 10 timesteps
  /scratch1/taylorm/udmtest/restart.*


data:  
serial:  3*1080057920/(5.04*60)/1024/1024 = 10.2
udm:     3221315704./(1.05*60)/1024/1024  = 49
CPU times (min):   (avg/max)
initialization:      0.55     0.55                                              
dns_solve:           7.13     7.13  per timestep:    0.71284   0.71284          
   time_control            5.04     5.04                                        
   RHS                     1.41     1.43                                        
   transpose_to_z          1.23     2.01                                        
   transpose_from_z        0.84     0.86                                        
   transpose_to_x          0.05     0.05                                        
   transpose_from_x        0.02     0.02                                        
   transpose_to_y          0.10     0.10                                        
   transpose_from_y        0.07     0.08                    


time=  0.01734(     0)  next output=*********  LSF minutes left:    -1. 0       
for next timestep: delt=0.0017043 cfl_adv= 1.500 cfl_vis= 0.005                 
max: (u,v,w)      0.82344867316324     0.85623639543220     0.86055051090769    
<z-vor>=-0.12174E-16   <hel>=-0.11408E-01   max(vor) 0.43619E+03                
R_lambda=   313.37306  R=290495.00349 lambda=     0.00691                       
mesh spacing/eta:        9.8435       9.8435       9.8435                       
spherical dealiasing kmax*eta:        0.0479                                    
eddy turnover time:        2.7761                                               
ke:  0.0365315377  enstropy:     7645.3726         total d/dt(ke):   0.0000000  
d/dt(ke) vis=  -0.0263184 f=   0.0000000 alpha=   0.0000000 total=  -0.0263184  


time=  0.03477(    10)  next output=  0.03477  LSF minutes left:    -1. 0       
for next timestep: delt=0.0017786 cfl_adv= 1.500 cfl_vis= 0.005                 
max: (u,v,w)      0.83086083344794     0.82621640130014     0.87612013435275    
<z-vor>= 0.30293E-16   <hel>=-0.11222E-01   max(vor) 0.41480E+03                
R_lambda=   290.86911  R=290495.00349 lambda=     0.00646                       
mesh spacing/eta:       10.1553      10.1553      10.1553                       
spherical dealiasing kmax*eta:        0.0464                                    
eddy turnover time:        2.4209                                               
ke:  0.0360904728  enstropy:     8661.1591         total d/dt(ke):  -0.0297827  
d/dt(ke) vis=  -0.0298152 f=   0.0000000 alpha=   0.0000000 total=  -0.0298152  


Random numbers on a 2x1x256:, run for 10 timesteps:
time=  0.01675(    10)  next output=  0.01675  LSF minutes left:    -1. 0       
for next timestep: delt=0.0017638 cfl_adv= 1.500 cfl_vis= 0.005                 
max: (u,v,w)      0.85318461175931     0.89334461353931     0.87389815907870    
<z-vor>=-0.10408E-16   <hel>=-0.48312E-01   max(vor) 0.42825E+03                
R_lambda=   316.42675  R=290495.00349 lambda=     0.00697                       
mesh spacing/eta:        9.8042       9.8042       9.8042                       
spherical dealiasing kmax*eta:        0.0481                                    
eddy turnover time:        2.8257                                               
ke:  0.0365933275  enstropy:     7523.9083         total d/dt(ke):  -0.0258968  
d/dt(ke) vis=  -0.0259003 f=   0.0000000 alpha=   0.0000000 total=  -0.0259003  
writing output files at t=    0.0168                                            


restart.* files generated with 1 1 256
restart.h5 with 1 1 256: pass
restart.h5 with 2 1 128: pass
restart.h5 with 2 1 256: pass

restart.* files generated with 2 1 256
(but restarting from a t=0 restart file) 
restart.h5 with 2 1 256: pass


**********************************************************************************
10/21/02
runs stored on /ccs/scratch/taylorm/cj

128^3

cj1  mu=1e-4   alpha=16h   power spectra starting to turn at tail

cj3  mu=25e-4  alpha=0     delx/eta = .94   too viscous. 
cj2  mu=10e-4  alpha=0     delx/eta = 1.9  killed at t=2.1  epsilon=3.6

cj12 mu=6.25e-4   NEXT
cj13 mu=1.5e-4  
cj14 mu=.4e-4     


cj4  mu=25e-4  alpha=6h  
cj5  mu=6.25e-4  alpha=6h    epsilon=1.1
cj6  mu=1.5e-4  alpha=6h  
cj7  mu=.4e-4  alpha=6h  

cj8  mu=25e-4   alpha=1h  
cj9  mu=6.25e-4  alpha=1h  
cj10  mu=1.5e-4  alpha=1h  
cj11  mu=.4e-4  alpha=1h  


cj15  alpha=0       mu=0.8e-4
cj16  alpha=1 delx  mu=0.8e-4
cj17  alpha=6 delx  mu=0.3e-3


**********************************************************************************
10/15/02


iso12w256A IN QA  up to t=88 so far
   oops: this run is really running at 512^3.  
 
   zero initial condition
   FM=5/(2pi)**2, mu=6e-4/(2pi)**2
   mu=1.5e-3/(2pi)**2 = 3.7995e-5 

   At t=13 changed  mu=6e-4/(2pi)**2 = 1.52e-5
                    epsilon=.15/(2pi)**2, which is too small. 

IN:
   at t=90 changed  FM = 18/(2pi)**2 
                    epsilon=  
   run to t=137, still running.  lost all *.isoave files up to t=181.
   run 6 turnover times: 7*6 = 42.  Run to t=232.  
   run 12 turnover times: 7*12 = 84.  Run to t=274.  done 284. 
   
 
iso12w250B     
 
   iso12 initial condition
   FM=5/(2pi)**2, mu=6e-4/(2pi)**2
   nu=1.5e-3/(2pi)**2 = 3.7995e-5 

   check to see if this run is isotropic?
   ran to t=200


**********************************************************************************
9/19/02
large alpha runs

Run:  spectrum peaked at k=8

      decay with alpha=inf, 0
      (need initial condition also peaked at 8, or start from
      on of our forced runs?)

Run:  forced, alpha=inf,1/8,1/16,1/32


FORCING: iso
iso_256B (QA)
   alpha=inf.
   mu=.25e-4      viscosity appears to be much to high
                  spectrum drops to 1e-2 after forcing (at k=9)
                  and remains flat for all k>9

iso_256C (QA)     didn't run.  
   alpha=.125
   mu=.25e-4

iso_256D (QA)     looks good: some -3 slope, too viscous?
   alpha=inf.
   mu=4e-4 

iso_256E (QA)       died -QA down.
   alpha=inf.
   mu=1e-4 


FORCING: iso12
iso12_250I    Bluemtn done
   alpha=inf
   mu=4e-4
iso12_250J    Bluemtn done
   alpha=inf
   mu=8e-4


iso12_256B (QA)  done
    alpha=infinity
    mu=1e-4

iso12_256C (QA)  
    alpha=infinity
    mu=.25e-4  
iso12_256D
    F = 1, .128
    alpha=infinity
    mu=.25e-4  
iso12_256E        ran to t=7
    F = 1, 4.5
    alpha=infinity
    mu=.25e-4  
iso12_256F        ran to t=50  
    decay case
    alpha=infinity
    mu=.25e-4  
iso12_256G        ran to t=5  
    alpha=1.0
    mu=.25e-4  
iso12_256H      
    alpha=.25
    mu=.25e-4  
iso12_256I      
    alpha=1/16 = .0625   ran to t=3.    Looks like iso12_256B
    mu=1.00e-4                          too viscous???

iso12_256J      
    alpha=1/16 = .0625   ran to t=4.2.  
    mu=2.75e-4  
    compare to iso12_256 from shankara.  

iso12_256K      
    alpha=1/64 = 4 delx  ran to t=6
    mu=2.75e-4  

iso12_256L      
    alpha=1/256 = 1 delx  IN
    mu=2.75e-4  


TODO:
print spec?.ps on secure
print spec3.00.ps on certs.


**********************************************************************************
7/7/02

ISOW tuning runs.

mu=1.35e-3
forcing in wave numbers 1&2, or 2&3,
tuned so that epsilon=.5
R_l and E are too low when compared to Gotoh's results.

G = ||f|| L^3 / mu^2  =  epsilon^.5/mu^2 

                                    
  mu      epsilon   ke     R_l      241*eta         G  (L^3/T^.5 = 1e-6)
1.35e-3     .09    .08      20
1.35e-3     .27    .22      29
1.35e-3     .55    .30      29        2.0            .4M

.75e-3      .5      .5      65         
.25e-3      .5      .6     132                       11M

1.35e-3/36  .55/36   ?     130         .32           83M  (Gotoh)


ke  .215
R_lambda = 50.0    lambda = 50*mu/sqrt(2 KE) = .103 
mu = 1.35e-3
delx/eta = .803   eta=1/(.803*128) = .00973


lambda/eta = 10.5

lambda/eta =   sqrt(10 KE mu /epsilon)   (mu^3/epsilon)^-.25
           =   sqrt(10) KE^.5  mu epsilon^-1  mu^-3/4 epsilon^.25
           =   sqrt(10) KE^.5  mu^1/4 epsilon^-3/4 


**********************************************************************************
7/1/02
new isotropic runs

Gotoh resolution criterion:  kmax * eta > 1/2pi
works out to be:  delx/eta <= 3
so lets run with delx/eta = 2?  


dealiasing:  fft-sphere 


iso12_250A  ran to t=10.  
iso12_256A  ran to t=11.

    iso12 forcing:  k1=.5   (forcing disabled if it removes energy)
                   k2=.5
    mu=1.35e-3:  delx/eta=1.0, R_l=60   epsilon=4   G=1M


iso12_256 (shankara) forcing, k1=.5, k2=.5 
   /ccs/scratch/taylorm/dns:

   mu=1.25e-3    [0.0,0.75]     delx/eta=.77  next?                 epsilon
   mu=4e-4       (0.75,1.25]    delx/eta=2.0     4/5 law:  .743
   mu=2e-4       (1.25,1.50]    delx/eta=3.46    4/5 law:  .6875
   mu=3e-4       (1.50,1.75]    delx/eta=2.48    4/5 law:  .675
   mu=2.75e-4    (1.75,3.5]  t=2.25      2.64              .69      4.32
                               2.5       2.56              .68      3.84
                              2.75      2.58              .79      3.97
                              3.00      2.70              .64      4.74
                              3.25      2.52              .70      3.60  


iso_256A  (Qsc)
   iso forcing.  k**4 in wave numbers up to 9
   mu=1.35e-3 delx/eta=1.73  epsilon=95    G = 5M
   R_l=30


G = ||f|| L^3 / mu^2  =  epsilon^.5/mu^2  


iso12_500A    iso12 forcing:  k1=.5  (forcing disabled if it removes E)
                              k2=.5
               
        t=[0,1.13]:  mu=3e-4  R_l=140  delx/eta=1.2  epsilon=3.70 
        t=[1.13,4+]:  mu=1e-4  R_l=252  delx/eta=2.8  epsilon=3.80 
                t=5.56   4/5 law max:   .7551
                t=5.0                   .7338
                  4.57                  .717
                  4.00                  .807
                  3.00                  .768


iso12_512 (ON O3K)   done
   t=0...5  mu=2.75e-4
   t=.5 to 3.7  mu=1e-4
   t=3.5 ->     ran on QA.  
   so data after 3.7 switches to little endian
   (scalars switches at 3.5)


iso12w512A    mu=6e-4/(2pi)**2   FM=15/(2pi)**2  

               Run to t=2.9:   R_l = 260
                               epsilon = -.0383  = 1.51/(2pi)**2   
                               delx/eta = 3.5

               epsilon to large, lowering FM.  

               FM=5/(2pi)**2 for runs starting at 2.92:  
               IN    R_l=250  delx/eta=2.37  epsilon=.00763 = .3012

               t=40.  4/5 law max:  .744
                 35                 .7511
                 30                 .779 
                 25                 .7612
                 20                 .7484
                 15                 .7695

iso12w512B
   in QA
   same as  iso12w_512A, but with zero initial condition
   FM=5/(2pi)**2, mu=6e-4/(2pi)**2


i12_1024A run at mu=.35e-4  IN


iso12_640    mu=.6e-4 on guyot?
    1x1x320  IN this weekend?    (lower iso12_256 to 128 cpus???)
    stopped because I was kicked off guyot


FUTURE RUNS:
    as above, but using dnsgrid?
     dnsgrid with lower-tol CG?


     256:  2.75e-4
     500:  1.00e-4
     640:   .7e-4     delx=2.7  push this a little with mu=.6
     768:   .53e-4
    1024:  0.35e-4  


**********************************************************************************
6/28/02
KH with thin sheet for Chris Brislawn

CFL = 1.45   
mu=5e-5  (value)
KH-anal
subtype=0
fft-sphere

output: every t=2


**********************************************************************************
6/14/02

new iso12 forcing:
force so shells 1 and 2 each  E=.5

iso12w:  white noise, Gotoh stile forcing, with F(1)=F(2)=5

KE at t=0: 1.5
Gotoh: KE 1.7, epsilon=.5


              mu      forcing    KE   epsilon   R_l    h/eta   turnover
iso12-256a   1.35e-3    iso12    1.7    3.8     100     .8
iso12-256b   1.35e-3    iso12w   .45     .5      75     .5
(data at t=10)


**********************************************************************************
4/24/02
vxpair runs on autry, milkyway

stored in: /ccs/scratch/taylorm/vxpair (16Gb)
and: ~data2/vxpair  (9Gb)


6144x3072:  mu=2e-8 a touch noisy, goes away.
            mu=4e-8 ok?
            using delx goes like eta goes like mu^3/4 rule:  
              (4e-8)^(3/4)*6144 = mu^(3/4) * N
              .0174/N = mu^3/4           mu = Ref_mu* (6144/N)^(4/3)

            mu = 1e-8:  N = 17400.  

Formula:     mu ~ N^{-4/3}        mu ~ N^-2            good values  bad value
16384x8192   mu = 1.1e-8         5.6e-9
12288x6144   mu = 1.6e-8         1e-8
6144x3072(ref)mu = 4.0e-8        4e-8                       4e-8       2e-8 
4096x2048    mu = 6.9e-8         9e-8                       1e-7       7e-8
3072x1536    mu = 1.0e-7         1.6e-7 
2048x1024    mu = 1.7e-7         3.6e-7                     4e-7       2e-7

Results:
6144x3072    mu=1e-7 ok.    2e-8: a touch noisy, goes away
3072x1536    mu = 1e-7 ok.  tiny noise on leading edge.  goes away t=55

2048x1024
./gridsetup.py 1 1 1 2048 1024 1 2 2 0 2 2 0
initial condition subtype=1
mu formula above suggests:  mu=1.7e-7
make dnsvor

vx2048a   vx2048a.inp   mu=6e-8  very noisy at t=50
vx2048b   vx2048b.inp  mu=1e-7   noisy at t=49
vx2048c   vx2048c.inp  mu=2e-7   small amount of noise on leading edge at t=50
                                 reduced by t=55
vx2048d   vx2048d.inp  mu=4e-7   looks good


vx4096a                mu=1e-7   done, looks good
vx4096b                mu=7e-8   extremly small amount of noise at t=50


                       mu=4e-8   save, conservative choice?
vx6144a                mu=3e-8   looks good - but if you zoom in 
                                 there is a microscopic amount of noise.
vx6144b                mu=2e-8   a touch noisy, goes away


**********************************************************************************
2/20/02


**********************************************************************************
2/12/02

milkyway runs
VXPAIR case
init_cond_subtype==1    (subtype=3 goes unstable)
(onesided/inflow for w, PSI fixed for all t )
3072x1536  
mu=1e-7

new code with tracers!
RUNNING 
vx3072


mu=2e-8.  a touch noisy in one snapshot that I saw, went away.  
./gridsetup.py 24 1 1 6144 3072 1 2 2 0 2 2 0   2
6144x3072    2 cpu milkyway:  338m per t=1.  


if we use mu^3/4 rule, new grid:  10332x5000.  So lets go with 12288x6144.

12000x6000  t=.05
120 cpus
BlueMtn:  
120x1:      9.91
30x4:      10.31

Benchmarks for 12288x6144:   RES size: 5.7GB
./gridsetup.py 24 1 1 12288 6144 1 2 2 0 2 2 0   2
Shankara  t=.05
24x1       43m
6x4        40.6m

8x2        59m  (how many nodes?)
6x2        54m  (12 nodes too)
12x1       55m   (12 nodes too) 
12x1       85m   (6 nodes)


ASCI Q    t=.05    t=1 (extrapolated)   t=150
cpus      
2x1       39.05                           81 d
4x1
8x1
16x1
32x1      20.02     6.66                 42 d

64x1       9.84     3.28h                20.5d  
32x2      10.77
16x4       9.58
8x8       11.44
16x2      12.02?

128x1      4.89    1.63h                 10.2d

256x1      3.29
128x2      3.40                           7.1d
64x4       2.79/3.08                      5.9d
32x8       3.03


128x4      1.63    .54h                   3.4d


**********************************************************************************
1/18/02
VXPAIR case timings
init_cond_subtype==1
mu=1e-4
w=0, cutoff=0, bito_savart init only

2049x1025 autrey                     t=15: 508min     3.5d for t=150

3072x1536 milkyway 2x1    t=.1  4.29      t=15  (607m)  vx3072_
3072x1536 milkyway 2x1                    t=15 (1700m)  vx3072_onesided_
3072x1536 autrey   2x1    t=.1  6.67      t=15 (1337m)  vx3072b


milkyway:
4096x2048 milkyway    t=.1: 19.68m
                             8.19m (4 cpus 4x1)
                            11.52m ( 2 cpus 2x1)
                             8.48  ( 4 cpus 2x2)


BlueMtn  
6400x3200 25x5      t=.01 .67m    t=.1:  3.13m  t=.2: 6.08min
8000x4000 125x1     t=.05  3.25                                  6.7d for t=150
8000x4000 25x5      t=.05  3.36             
8000x4000 5x25      t=.05  3.56


6  arrays, 3 variables, 8  bytes each:   6*3*8*
(actual resident size about 50% smaller because dnsvor does not use all arrays)
32768*16384 = 70.4G
16384x8192  = 17.6G    
8192x4096  = 4.4G    
4096x2048  = 1.1G
3072x1536  = .729G    wont run on autry, 1 cpu.  
2048x1024  = .3G    
     

**********************************************************************************
1/17/02

milkyway runs
VXPAIR case
init_cond_subtype==1
3072x1536  
mu=1e-7

vx6144b_*
   b.c.: w=0/onesided, PSI fixed
   mu=2e-8
   IN 2 cpus milkyway
   t=15 estamated:  88h  3.67 days.  finished monday night?


vx3072_*
   b.c.:  w=0, PSI fixed 
   2 cpus  t=15 in 661m


vx3072_onesided_*
   b.c.:  w=0/onesided,  PSI recomputed every 5 timesteps
   2 cpus  t=15:  1700m


vx3072b_
   b.c.:  w=0/onesided, PSI fixed 
   1 cpus t=15  1337m (autrey)


**********************************************************************************
1/2/02

summary of ISO12 250^3 runs.
subtype=0:   tau (forcing relaxaction coefficient) = 5


        mu    tau    R_l   h/eta    turnover    epsilon   xyz-spec      R-spec
250   .1e-3    5     347   4.04      1.53        1.04      vear 1e-1     N/A
250E  .1e-3   25     513   5.20      1.36        2.87      vear 1e-1 
250F  .1e-3  500     529   5.34      1.33        3.18      vear 1e-1

250B  .2e-3    5     241   2.41      1.49        1.05      hookup        N/A
250C  .4e-3    5     167   1.44      1.44        1.07      hook 1e-3     N/A
250D  .8e-3    5     114   .850      1.41        1.05      hook 1e-6     N/A
250G  .8e-3   500    178   1.14      1.23        3.39      hook 1e-4     good
250H  5000h   500    668   7.16/1.42 1.42        2.95

250I  .8e-3   500    180   1.13      1.27        3.23      same as G, with new diags
      continuing...  with time averages.. done up to t=40
      continuing...  with transfer function t=40 up to 50.  

250J  .8e-3   500   (same as G, but with 4th order) done up to t=14
    

500    .1e-3    5     455   2.38      1.44        2.00     hook 1e-2     kink
500B   .1e-3  25 (stopped)
500C  1.35e-3 500     IN.  run 10 eddy turnover times to check isotropy
                           with more data then 512 run below.  


512    1.35e-3  500   131    .37      1.22        3.14
512c   1.35e-3  isow   72   .21       2.17         .30
             IN  run 10 eddy turnover times to check isotropy.  with isow forcing
512d   same as 512c, but with F(1)=F(2)=5.  IN
       code was changed on bluemtn, so this F scaling will be lost
       next time we do a pulldns

   
Energy:
250   1.31 down to .76
250E  1.31 up to 2.24
250F  1.31 up to 2.37

250D  1.31 down to .71
250G  1.31 up to 2.22

500   1.31 to 1.44


**********************************************************************************
12/12/01
shallow water runs

mu_hyper = 5000

sht/rune_5000_    t=11.5   
sht/rung_5000_    t=10   done


**********************************************************************************
12/04/01
iso12_250D_

iso12 initial condition
iso12 forcing  (subtype=0:  tau=5)
CFL=1.5/.25
mu=.8e-3  

IN

**********************************************************************************
12/04/01
iso12_250C_

iso12 initial condition
iso12 forcing  (subtype=0:  tau=5)
CFL=1.5/.25
mu=.4e-3  


**********************************************************************************
11/30/01
iso12_250B_

iso12 initial condition
iso12 forcing  (subtype=0:  tau=5)
CFL=1.5/.25
mu=.2e-3  

same as iso12_250_, but with mu 2x larger

R_l = 241
h/eta= 2.41
eddy turnover time: around 1.4
epsilon: about -1.0

4'5 law for delta=32h:   .5954   std=.0879


**********************************************************************************
11/28/01
iso12_500_

iso12 initial condition
iso12 forcing  (subtype=0:  tau=5)
CFL=1.5/.25
mu=.1e-3  

.1 done
looks like we can do t=.4 in one 720min run ????
IN now. but check if .4 will really run


**********************************************************************************
11/28/01
iso12_250_

iso12 initial condition
iso12 forcing  (subtype=0:  tau=5)
CFL=1.5/.25
mu=.1e-3  

a little under-resolved:
data average from t=5..11:    
R_l = 347 
h/eta = 4.03 

eddy turnover time: about 1.6
epsilon:  about  -1.0

Times:
we can do t=6 in one 720min run.  
0..11   done


**********************************************************************************
11/17/01

IMPULSE runs
../src/impulse/*
KH run with thick shear layer from E and Liu paper
run to time 1.5

kh0     dns (projection method)
kh1     dnsgrid

kh12    NS-alpha  alpha=.5h??       .00390625
kh11    NS-alpha  alpha=1h          .0078125
kh4     NS-alpha  alpha=2h               
kh10    NS-alpha  alpha=4h       
kh9     NS-alpha  alpha=8h    
kh8     NS-alpha  alpha=32h   


TOPHAT
kh13     NS-alpha alpha=2h  (width 10 points)   ifilt=5
kh14     NS-alpha alpha=4h  (width 20 points)   ifilt=10   
         NS-alpha alpha=4h  (width 20 points)   ifilt=5    bad
kh15     NS-alpha alpha=4h  (width 20 points)   ifilt=20   


GAUSSIAN
kh16     NS-alpha alpha=2h  ifilt=7       dk/dt seems good, check plot
kh17     NS-alpha alpha=4h  ifilt=15     dk/dt seem ok, check plot

JACOBI  error: 10%  (about 40 iterations)
         NS-alpha alpha=4h   fixed 10 iterations   unstable
kh20     NS-alpha alpha=4h   fixed 20 iterations   residual: .25
kh19     NS-alpha alpha=4h   fixed 80 iterations   residual: .03
kh18     NS-alpha alpha=4h   fixed 40 iterations   residual: .13

kh21     NS-alpha alpha=2h   fixed 40 iterations   residual: .13

rerun with tolerence?


CG
kh22     NS-alpha, alpha=4h  tol=.05d0  (about 10 iterations???)
                                        looks shitty
kh24     NS-alpha, alpha=4h  tol=.03d0  t=.5 ok, t=.75 bad  (about 20 iterations)
kh23     NS-alpha, alpha=4h  tol=.01d0  looks nice, about 30 iterations


kh2     impulse method.            crashes at t=.41
kh3     impulse-alpha   alpha=2h   crashes at t=.77
        impulse-alpha   alpha=2h CFL=.25  crashes at t=.77

kh5     impulse-alpha alpha=3h     crashes at .80
kh6     impulse-alpha alpha=8h     crashes at .9953
kh7     impulse-alpha alpha=32h    crashes at .507   time step problem?


**********************************************************************************
10/30/01

KH runs.  
autrey
src/kh

These runs were made to study a problem in the 1024x1024 KH
results.  Initial results looked great but were not symmetrical.
Noise at around t=.20 (which was then dissapated) triggered the
asymmetry.  The noise turned out to be caused by CFL.

Two possible fixes: lower CFL fixes problem.  Higher viscosity
does not fix problem.

1. lower CFL:
with mu=2.5e-5   CFL = 1.5:   noise triggers asymetric rollup
                 CFL = 1.45:  symmetric rollup


2. Increase viscosity?
mu=5e-5         same.  CFL=1.5 triggers assymetic rollup
                       CFL=1.45 ok


khN  1024x1024  mu = 2.5e-5    CFL: 1.50/.25   looks good.
     4th order method.  

kh0  1024x1024  mu = 2.5e-5    CFL: 1.50/.25  
     4th order method with new second derivative


khL  1024x1024  mu = 5.0e-5    CFL: 1.5/.25   Noise at t=.25
                                              KE dissapation does not match
                                              at t=.15 
                                              cfl_vis has no effect above .15
                                             
kkM  1024x1024  mu = 5.0e-5    CFL: 1.45/.25  looks good.


kh  1024x1024  mu = 2.5e-5    CFL: 1.5/.25   asymetric answer.  CFL, or resolution?
                                             cfl_vis has no effect above .073
                                             rerun?

khI 1024x1024  mu = 2.5e-5    CFL: 1.50/.25  repeat kh run
                                             check scalars as progressing
                                             yes - can see unstable at .25

khK 1024x1024  mu = 2.5e-5    CFL: 1.45/.25  looks great at t=1.0
khJ 1024x1024  mu = 2.5e-5    CFL: 1.40/.25  looks great at t=1.0
khH 1024x1024  mu = 2.5e-5    CFL: 1.25/.25  looks great at t=1.0
khG 1024x1024  mu = 2.5e-5    CFL: .75/.25   looks great at t=1


khD 512x512    mu = 2.5e-5    CFL: 1.5/.15   looks great at 1.0
                                             (cfl_vis has no effect above .03)

khE 256x256    mu = 2.5e-5    CFL: 1.5/.15   secondary vortex forms t=.75
khF 256x256    mu = 2.5e-5    CFL: .75/.15   same as above.


VIS CFL test:
128x128    mu=.003    1.5/.75                unstable at t=2
128x128    mu=.003    1.5/.70                unstable
128x128    mu=.003    1.5/.60                stable  
128x128    mu=.003    1.5/.50                stable


**********************************************************************************
10/30/01

Re scaling runs, using iso12 initial condition and forcing.
out to t=8.  


32^3  
n32_200  kediss(2/3)=200  e=2.05   R_l = 35   
    very smooth, steep decay
    h/eta=1.47

n32_100  kediss(2/3)=100 e=-1.74  R_l = 63
   sperhical spectrum looks good
   x,y,z become greater thatn -5/3, but not level
   h/eta=2.47
   
n32_50   kediss(2/3)=50  e=-2.1   R_l = 88
   sperhical spectrum looks good
   x,y,z ok except for last coefficient (becomes level)
   h/eta=4.2

n32_25   kediss(2/3)=25  e=-2.14  R_l = 131
    hook in spherical spectrum
    x,y,z become greater thatn -5/3, but not level
    h/eta=6.9
        

64^3 
n64_200 kediss(2/3)=200  e=-1.95  R_l=92    KE=1.23   cpu: 123m autrey
        spherical looks good
        x,y,z has hook
        h/eta=2.06

n64_100 kediss(2/3)=100  e=-1.86  R_l=144   KE=1.34   cpu: 130m mahi
        spherical looks good
        x,y,z has hook
        h/eta=3.5

n64_50  kediss(2/3)=50   e=-2.01  R_l=202   KE=1.38   cpu: 145m mahi
        spherical has hook
        x,y,z perfect -3/5, level off at end.
        h/eta=5.83
        

n128_400  kediss(2/3)=400 e=-.93    R_l=109  KE=.71  cpu: 61m  (nirvana 64)
    very smooth, steep decay
    h/eta=1.48

n128_200  kediss(2/3)=200 e=-1.06   R_l=154  KE=.76  cpu: (71m nirvana 64)  
    spherical looks good.  
    x,y,z hook up.
    h/eta=2.5

n128_100  kediss(2/3)=100 e=-1.03    R_l=225 KE=.78   cpu: 134m
    spherical:  ok except for last wave number hooks up.  
    x,y,z: hook up
    h/eta=4.2


256^3  
iso12_256_200  kediss(2/3)=200 e=-1.00  R_l=325  KE=.78  mu=?
h/eta = 1/(256*eta) = 3.63
     spherical looks ok, but in >5/3 region has a < 5/3 region (not level)
     x,y,z: hook up


diffusion formula: 

kmode(256)= 148        mu = kediss(2/3) / 1.72e6
kmode(128)=  74        mu = kediss(2/3) / 4.31e5
kmode( 64)=  37        mu = kediss(2/3) / 1.08e5
kmode( 32)=  18        mu = kediss(2/3) / 2.70e4

eta = (mu^3/e)^.25


**********************************************************************************
10/24/01

Memory leak on Q?

forcing: none
initial cond: Q=0, projection disabled

                               formula    obs
4 cpu running 256x256x256        603      crashes with auto arrays
                                         newcode: 603M, runs!


**********************************************************************************
10/8/01 
64^3 decaying turbulense
tag: temp1  output: temp1.out in temp1 directory

tag: temp2  output: temp2.out in src directory

check these two runs: structure functions shoold be the same.
second run should have epsilon structure function


**********************************************************************************
8/22/01 KH 384x384x384 parallel benchmark runs on SGI 128cpus.

pbench.job
input: benchmark1024.inp  
(cfl = 1.5/.25  vis=2e-4, t_final=.002)


SGI FFT (2dM)
1 message at a time

1x1x128
4x4x8
1x8x16


**********************************************************************************
8/9/01  KH 1024x1024 

mu = .0002/8 = 2.5e-5
CFL = 1.5
run to time=3.  Looks good.  


**********************************************************************************
8/7/01  More 128x128 runs.  

128x128 can handle mu=.0002 to t=1.  What happens if we run longer?

KH inintial condition:

4th order:

CFL=2.0   t=4 noisy, but just two large blobs
CFL=1.9   t=4 looks good, t=10 is noisy
CFL=1.8   t=10 ok, t=20 noisy
CFL=1.7   t=20 ok, t=30 noisy.
CFL=1.65  goes bad at t=47! 
CFL=1.6   t=4 ok, t=10 ok, t=30 ok, t=50 ok
CFL=1.5   t=4 ok, t=10 ok, t=30 ok, t=50 ok
CFL=1.0   t=4 looks good, t=10 looks good.  two large blobs


3rd order:

CFL=1.5   t=1 looks bad.
CFL=1.25  goes bad around t=22
CFL=1.20  t=50 looks good.  
CFL=1.125 t=50 looks good.n
CFL=1.0   t=50 looks good.

4th order cost: 4/1.6 = 2.5            
3rd order cost: 3/1.20 = 2.5

So for this problem, might as well use rk4.


**********************************************************************************
9/29/01  CFL and viscosity test

128x128 Kelvin Helmholz initial condition.
How much viscosity is needed to keep solution smooth?


mu=.00001   cfl=.25   noisy at t=1
mu=.00005   cfl=.25   ?
mu=.0001    cfl=1.70  smooth, round vortex with 2 filiments at t=1

mu=.00005   128x128 does not look like 256x256.   (cfl=1.0)
mu=.0002    128x128 does look like 2565x256       (cfl=1.0)

So, 128x128 can resolve a viscosity of .0002 out to t=1.0,
but it cannot resolve a viscosity of .00005

Now test CFL, using the "looks good norm, at t=1.0"

CFL rk4
3.0   vorticity looks bad
2.5   vorticity looks bad
2.2   vorticity looks bad
2.1   vorticity looks good
2.0   vorticity looks good - same as 256x256 result
      running past t=1? 

CFL rk3  (1.5 rk3 = same cost as 2.0 rk4)
2.0   bad
1.5   bad
1.4   some bad oscillations
1.35  looks good
1.25  looks good
1.0   looks good

Result: rk4 is more efficient.

What about looking at energy spectrum, or structure funtions?


**********************************************************************************
9/28/01  CFL and viscosity test

CFL:  128x128

dealiased, vis=0
cfl_adv
2.0   t=.5 bad - streaks of vorticity off of sheet
      but stable so far (1200 time steps, t=7)
1.90  t=.5 bad, not as bas as t=2.0
1.80  t=.5 ok
1.75  t=.5 ok
1.50  t=.5 ok
1.15  t=.5 ok
1.00  t=.5 ok


dealiased, vis=.0001
2.0:  t=.5 bad - smoother, but still has streaks of vorticity off of sheet
      t=1.0  oval vortex + two filiments - outer filiment has oscillations
    
1.70  t=.5  looks good - much smoother than vis=0
      t=1.0 looks good - round vortex + two filiments with no oscillations

1.00  t=1.0 looks good - round vortex + two filiments with no oscillations