Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orion, Intel environment for spack-stack-1.8.0 breaks the system tar command #1355

Open
srherbener opened this issue Oct 22, 2024 · 29 comments · May be fixed by JCSDA/CRTMv3#196 or #1435
Open

Orion, Intel environment for spack-stack-1.8.0 breaks the system tar command #1355

srherbener opened this issue Oct 22, 2024 · 29 comments · May be fixed by JCSDA/CRTMv3#196 or #1435
Assignees
Labels
bug Something is not working INFRA JEDI Infrastructure

Comments

@srherbener
Copy link
Collaborator

Describe the bug

The Orion, Intel spack-stack-1.8.0 environment, specifically the LD_LIBRARY_PATH setting, interferes with the execution of the system tar command. See the next section on reproducing the error.

Simply running tar outside of the ecbuild command gets the same failure. After some tracing it appears that things go awry when loading the gzip functionality, where the spack-stack /apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/intel/2021.9.0/libxcrypt-4.4.35-ebrdc3w/lib/libcrypt.so.2 shared library gets loaded instead of the system libcrypt library (/usr/lib64/lib/libcrypt.so.2).

I found that if LD_LIBRARY_PATH is unset (or /usr/lib64/lib is prepended to the front) that the tar command then works properly.

I need help with coming up with a workable fix for this. I've tried

  • Alter the CRTM test/CMakeLists.txt file to unset LD_LIBRARY_PATH before running tar
    • This involves the cmake execute_process command, and I have probably done something wrong when I tried this. It sure seems you should be able to get this to work.
  • Defining and exporting a bash function for tar that that unsets LD_LIBRARY_PATH then runs /usr/bin/tar
    • This works when run in the shell, but does not work when run from cmake

Does anyone have any ideas about how to address this?

Also, could someone check my environment setup to make sure I'm not missing something.

Thanks!

To Reproduce
Steps to reproduce the behavior:

load the intel environment by sourcing a script file that contains the following sequence:

#!/bin/bash

echo "Loading EWOK-SKYLAB Environment Using Spack-Stack 1.8.0"

SPACK_STACK_INTEL_ENV=/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0

# load modules
module purge
module use $SPACK_STACK_INTEL_ENV/install/modulefiles/Core
module load stack-intel/2021.9.0
module load stack-intel-oneapi-mpi/2021.9.0
module load stack-python/3.11.7

jedi-host-post-load() {
  module swap git-lfs git-lfs/3.1.2
}

# This is a fix for the issue where the spack-stack-1.8.0 udunits
# module does not get loaded propery. Without this workaround, the
# udunits module from the "spack-managed" gets loaded instead and
# ecbuild on jedi-bundle fails.
#
# Setting LMOD_TMOD_FIND_FIRST gets rid of the default marking
# of modules, and the modification of MODULEPATH makes sure
# that spack-stack-1.8.0 modules are found first before same
# named modules in other directories (ie, "spack-managed")
export LMOD_TMOD_FIND_FIRST=yes
module use $SPACK_STACK_INTEL_ENV/install/modulefiles/intel/2021.9.0

# Load JEDI modules
module load jedi-fv3-env
module load ewok-env
module load soca-env

The export and module use commands near the end are the workaround to get the proper udunits package to load.

Run ecbuild:

ecbuild -DPython3_EXECUTABLE=$(which python3) $JEDI_SRC

This results in the following error:

-- Building tests for CRTM v3.1.1.
-- Downloading CRTM coeffs files from: https://bin.ssec.wisc.edu/pub/s4/CRTM//fix_REL-3.1.1.2.tgz to /work2/noaa/jcsda/herbener/jedi/build-intel/test_data/3.1.1/fix_REL-3.1.1.2.tgz
-- Checking if /work2/noaa/jcsda/herbener/jedi/build-intel/test_data/3.1.1/fix_REL-3.1.1.2 already exists...
-- Untarring the downloaded file (~2 minutes) to /work2/noaa/jcsda/herbener/jedi/build-intel/test_data/3.1.1
tar: Relink `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so' with `/usr/lib64/libm.so.6' for IFUNC symbol `sincosf'
CMake Error at crtm/test/CMakeLists.txt:106 (message):
  Failed to untar the file.


-- Configuring incomplete, errors occurred!

Expected behavior
The tar command run from the CRTM CMake configuration should complete successfully.

System:
Orion, Intel

Additional context
Add any other context about the problem here.

@climbfuji
Copy link
Collaborator

This is a known bug in the Intel oneAPI distribution itself. I sent the Intel developers a bug fix for it at the beginning of the calendar year, and I also sent the bug fix to the Orion/Hercules sysadmins. According to @RatkoVasic-NOAA, this problem was fixed for some of the libraries in the oneAPI distribution, but maybe not all?

@srherbener
Copy link
Collaborator Author

This is a known bug in the Intel oneAPI distribution itself. I sent the Intel developers a bug fix for it at the beginning of the calendar year, and I also sent the bug fix to the Orion/Hercules sysadmins. According to @RatkoVasic-NOAA, this problem was fixed for some of the libraries in the oneAPI distribution, but maybe not all?

Thanks for the response @climbfuji! Very helpful information.

@RatkoVasic-NOAA do you think there might be some libraries in the oneAPI installation that have not been repaired yet? And repairing those might address this issue? Thanks!

@RatkoVasic-NOAA
Copy link
Collaborator

@srherbener I avoided that error by purging all loaded modules from my environment, so for spack-stack installation (both on Orion and Hercules) I started with 'module purge' and then all errors associated with "Failed to untar the file"" disappeared.

@climbfuji
Copy link
Collaborator

Isn't module purge part of the standard instructions for everyone before loading any spack-stack modules?

Also, I would be surprised if that really solved the problem - but I'd be happy to be surprised, for sure :-)

@srherbener srherbener added the INFRA JEDI Infrastructure label Oct 23, 2024
@srherbener
Copy link
Collaborator Author

@srherbener I avoided that error by purging all loaded modules from my environment, so for spack-stack installation (both on Orion and Hercules) I started with 'module purge' and then all errors associated with "Failed to untar the file"" disappeared.

@RatkoVasic-NOAA in the example environment setting (in the description above) I have a call to module purge before the rest of the module load commands. Is this what you are referring to? Or is it something else that I am missing? Thanks!

@RatkoVasic-NOAA
Copy link
Collaborator

@srherbener what happened to me while installing spack-stack, I was getting same error message as you (I wasn't aware that I haven't purged modules before installation).
Modules loaded by default were:

  1. contrib/0.1 2) noaatools/3.1 3) intel-oneapi-compilers/2023.1.0

Then, I purged modules and error messages disappeared.
I thought that sys admins fixed problematic libraries, and it is working because of that, but I realized that most likely it was because I haven't purged modules before installing spack-stack on Orion (and Hercules).
I hope I managed to explain train of thoughts and chain of events. :-)

@climbfuji
Copy link
Collaborator

So that means the sysadmins didn't fix anything yet, you just unloaded the modules when you had a problem. I've seen in the past that some applications don't show this problem, while others do. On discover, for example, fv3-jedi would run fine, but geos-jedi failed with the above error.

Someone other than a weird dude from a different agency with no purpose on orion/hercules should be making a lot of noise all the way up the hierarchy until the sysadmins fix this.

@srherbener
Copy link
Collaborator Author

Right after logging into orion, I see this:

orion-login-2[5] herbener$ module list
No modules loaded
orion-login-2[7] herbener$ echo $MODULEPATH
/apps/spack-managed/modulefiles/linux-rocky9-x86_64/Core:/apps/other/modulefiles:/apps/containers/modulefiles:/apps/licensed/modulefiles

Then I source our JCSDA, JEDI orion, intel environment script, which does a module purge first before any module load commands. Now I see:

(venv-intel) orion-login-2[11] herbener$ module list

Currently Loaded Modules:
  1) intel-oneapi-compilers/2023.1.0  73) py-h5py/3.11.0
  2) stack-intel/2021.9.0             74) py-cftime/1.0.3.4
  3) intel-oneapi-mpi/2021.9.0        75) py-netcdf4/1.5.8
  4) stack-intel-oneapi-mpi/2021.9.0  76) py-bottleneck/1.3.7
  5) gettext/0.21                     77) py-numexpr/2.8.4
  6) glibc/2.34                       78) py-et-xmlfile/1.0.1
  7) libxcrypt/4.4.35                 79) py-openpyxl/3.1.2
  8) zlib-ng/2.1.6                    80) py-six/1.16.0
  9) sqlite/3.43.2                    81) py-python-dateutil/2.8.2
 10) util-linux-uuid/2.38.1           82) py-pytz/2023.3
 11) python/3.11.7                    83) py-pyxlsb/1.0.10
 12) stack-python/3.11.7              84) py-xlrd/2.0.1
 13) snappy/1.1.10                    85) py-xlsxwriter/3.1.7
 14) zstd/1.5.2                       86) py-xlwt/1.3.0
 15) c-blosc/1.21.5                   87) py-pandas/1.5.3
 16) curl/7.76.1                      88) py-pycodestyle/2.11.0
 17) hdf5/1.14.3                      89) py-pyhdf/0.10.4
 18) netcdf-c/4.9.2                   90) libyaml/0.2.5
 19) netcdf-fortran/4.6.1             91) py-pyyaml/6.0.1
 20) fms/2024.02                      92) py-scipy/1.12.0
 21) cmake/3.27.9                     93) py-packaging/23.1
 22) git/2.31.1                       94) py-xarray/2023.7.0
 23) nccmp/1.9.0.1                    95) jedi-base-env/1.0.0
 24) parallel-netcdf/1.12.3           96) jedi-fv3-env/1.0.0
 25) parallelio/2.6.2                 97) py-awscrt/0.16.16
 26) python-venv/1.0                  98) py-colorama/0.4.6
 27) py-pip/23.1.2                    99) py-cryptography/38.0.1
 28) wget/1.21.3                     100) py-distro/1.8.0
 29) base-env/1.0.0                  101) py-docutils/0.19
 30) boost/1.84.0                    102) py-jmespath/1.0.1
 31) openblas/0.3.24                 103) py-wcwidth/0.2.7
 32) py-setuptools/63.4.3            104) py-prompt-toolkit/3.0.38
 33) py-numpy/1.23.5                 105) py-ruamel-yaml/0.17.16
 34) bufr/12.1.0                     106) py-ruamel-yaml-clib/0.2.7
 35) eigen/3.4.0                     107) py-urllib3/1.26.12
 36) eckit/1.27.0                    108) awscli-v2/2.13.22
 37) gsl-lite/0.37.0                 109) ecflow/5.11.4
 38) netcdf-cxx4/4.3.1               110) py-botocore/1.34.44
 39) py-pybind11/2.11.0              111) py-s3transfer/0.10.0
 40) bufr-query/0.0.2                112) py-boto3/1.34.44
 41) ecbuild/3.7.2                   113) py-contourpy/1.0.7
 42) libpng/1.6.37                   114) py-cycler/0.11.0
 43) openjpeg/2.3.1                  115) py-fonttools/4.39.4
 44) eccodes/2.33.0                  116) py-kiwisolver/1.4.5
 45) fftw/3.3.10                     117) py-pillow/9.5.0
 46) fckit/0.11.0                    118) py-pyparsing/3.1.2
 47) fiat/1.2.0                      119) py-matplotlib/3.7.4
 48) ectrans/1.2.0                   120) proj/9.2.1
 49) qhull/2020.2                    121) py-certifi/2023.7.22
 50) atlas/0.38.1                    122) py-pyproj/3.6.0
 51) sp/2.5.0                        123) py-pyshp/2.3.1
 52) gsibec/1.2.1                    124) geos/3.12.1
 53) libjpeg/2.1.0                   125) py-shapely/1.8.0
 54) krb5/1.21.2                     126) py-cartopy/0.23.0
 55) libtirpc/1.3.3                  127) py-smmap/5.0.0
 56) hdf/4.2.15                      128) py-gitdb/4.0.9
 57) jedi-cmake/1.4.0                129) py-gitpython/3.1.40
 58) libxt/1.3.0                     130) py-click/8.1.7
 59) libxmu/1.1.4                    131) py-pyjwt/2.4.0
 60) libxpm/3.5.17                   132) py-charset-normalizer/3.3.0
 61) libxaw/1.0.15                   133) py-idna/3.4
 62) udunits/2.2.28                  134) py-requests/2.31.0
 63) ncview/2.1.9                    135) py-globus-sdk/3.25.0
 64) json/3.11.2                     136) py-globus-cli/3.16.0
 65) json-schema-validator/2.3.0     137) py-markupsafe/2.1.3
 66) odc/1.5.2                       138) py-jinja2/3.1.2
 67) py-attrs/21.4.0                 139) ewok-env/1.0.0
 68) py-pycparser/2.21               140) antlr/2.7.7
 69) py-cffi/1.15.1                  141) gsl/2.7.1
 70) py-findlibs/0.0.2               142) nco/5.1.6
 71) py-eccodes/1.5.0                143) soca-env/1.0.0
 72) py-f90nml/1.4.3                 144) git-lfs/3.1.2

 

echo $MODULEPATH
/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/intel/2021.9.0:/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/intel-oneapi-mpi/2021.9.0-p2ray63/intel/2021.9.0:/apps/spack-managed/modulefiles/linux-rocky9-x86_64/intel-oneapi-mpi/2021.9.0-a66eaip/oneapi/2023.1.0:/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/gcc/12.2.0:/apps/spack-managed/modulefiles/linux-rocky9-x86_64/oneapi/2023.1.0:/apps/contrib/spack-stack/spack-stack-1.8.0/envs/ue-intel-2021.9.0/install/modulefiles/Core:/apps/spack-managed/modulefiles/linux-rocky9-x86_64/Core:/apps/other/modulefiles:/apps/containers/modulefiles:/apps/licensed/modulefiles

Then I try tar:

(venv-intel) orion-login-2[13] herbener$ tar tzfv build-intel/test_data/3.1.1/fix_REL-3.1.1.2.tgz 
tar: Relink `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so' with `/usr/lib64/libm.so.6' for IFUNC symbol `sincosf'
Segmentation fault (core dumped)

which breaks. If I wipe out LD_LIBRARY_PATH, then the tar command works:

(venv-intel) orion-login-2[14] herbener$ LD_LIBRARY_PATH="" tar tzfv build-intel/test_data/3.1.1/fix_REL-3.1.1.2.tgz 
drwxr-xr-x bjohnson/domain users 0 2024-08-14 10:53 fix_REL-3.1.1.2/
drwxr-xr-x bjohnson/domain users 0 2024-08-29 14:45 fix_REL-3.1.1.2/fix/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/IR_Ice/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/IR_Ice/SEcategory/
drwxr-xr-x bjohnson/domain users 0 2024-02-26 10:05 fix_REL-3.1.1.2/fix/EmisCoeff/IR_Ice/SEcategory/netCDF/
...

The module purge approach does not appear to help with this issue.

After some debugging, I discovered that the issue appears to be that we set LD_LIBRARY_PATH according to the module loads (see the initial description above) places the spack-stack libxcrypto path in front of the system libcrypto path. So when tar executes, the wrong libcrypto library (ie the spack-stack one) gets loaded instead of the correct libcrypto library which is the system one. Unfortunately, we need the LD_LIBRARY_PATH to be set in the order we are getting so that the jedi-bundle build and test all work correctly.

@RatkoVasic-NOAA
Copy link
Collaborator

I see. How about prepending LD_LIBRARY_PATH with system path to libxcrypto in modulefile. Then exec will find that one first and use it instead of spack-stack's?

@climbfuji
Copy link
Collaborator

The underlying problem however is this:

tar: Relink `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so' with `/usr/lib64/libm.so.6' for IFUNC symbol `sincosf'
Segmentation fault (core dumped)

It only shows up in libcrypto because the spack-stack librypto ldd-s to libimf.so which has the bug I described above.

@eap eap self-assigned this Oct 28, 2024
@srherbener
Copy link
Collaborator Author

I looked into this further and discovered that libimf.so does not appear to have the fault (a shared library that is wrongly marked as a static library) that @climbfuji reported. I tried running ldd on libimf.so and that indicate a shared library:

(venv-intel) orion-login-4[138] herbener$ ldd /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so 
ldd: warning: you do not have execution permission for `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so'
	linux-vdso.so.1 (0x00007ffd15ffb000)
	libintlc.so.5 => /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libintlc.so.5 (0x00007fa7d6e09000)
	libc.so.6 => /usr/lib64/libc.so.6 (0x00007fa7d6c00000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa7d728c000)
(venv-intel) orion-login-4[140] herbener$

It might be the case that the libifm.so library has not had the libm.so properly linked in, but the libifm.so library appears to be correctly marked as a shared library and running nm on the libifm.so file shows this

(venv-intel) orion-login-4[144] herbener$ nm /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libimf.so | grep sincosf
0000000000244b20 T __bwr_sincosf
0000000000233b10 t __libm_sincosf
00000000003a9100 d __libm_sincosf_chosen_core_func_x
00000000003a4900 d __libm_sincosf_dispatch_table_x
0000000000233aa0 t __libm_sincosf_dispatch_table_x_init
00000000002af000 t __libm_sincosf_e7
00000000002aede0 T __libm_sincosf_ex
00000000002990e0 t __libm_sincosf_huge
00000000002f7650 T __libm_sincosf_rf
000000000029b320 T __libm_sse2_sincosf
0000000000233b20 T sincosf
00000000003107c0 T sincosf16

which appears that the sincosf reference is actually defined (contrary to the tar error message).

However, I did find that the libirc.so file does indeed have the problem @climbfuji originally reported (libirc.so is involved in the error @climbfuji reported long ago to Intel).

(venv-intel) orion-login-4[140] herbener$ ldd /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so
ldd: warning: you do not have execution permission for `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so'
	statically linked
(venv-intel) orion-login-4[141] herbener$

It appears that loading the spack-stack built libcrypt.so.2 library instead of the system libcrypt.so.2 library introduced the intel oneAPI libraries into the mix and somehow got the dynamic loader confused.

At this point, I think the pragmatic path forward is to stop investing time now to fix this issue and defer this issue to spack-stack-1.9.0 to give us more time to resolve this issue. In this spirit, I have added post release notes for spack-stack-1.8.0 on the spack-stack wiki describing a manual workaround for this issue (https://github.com/JCSDA/spack-stack/wiki/Post%E2%80%90release-updates-for-spack%E2%80%90stack%E2%80%901.8.0) which can hold us out until 1.9.0.

@srherbener
Copy link
Collaborator Author

I think it is still worthwhile to submit a Priority support ticket to intel about the libirc.so issue. Someone with a NOAA email can submit a Priority support ticket here: https://supporttickets.intel.com/. Unfortunately, my NOAA email has been deactivated and my UCAR email does not grant me access to a Priority support request.

Could someone at NOAA please submit the Priority support ticket on my behalf? Thanks!

Here is text that we could place in the ticket:


We have intel oneAPI version 2023.1.0 installed on one of our HPC platforms and we are running into trouble with using the oneAPI provided libirc.so library. This library appears to be intended to be loaded as a shared library, but the dynamic loader thinks it is a static library and fails to load the depndencies of libirc.so ultimately causing undefined reference error.

Running ldd on the installed libirc.so reveals that this file is understood to be a static library:

herbener$ ldd /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so
ldd: warning: you do not have execution permission for `/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so'
statically linked

However, running the file command on the installed libirc.so file indicates the file is a shared library:

herbener$ file /apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so
/apps/spack-managed/gcc-11.3.1/intel-oneapi-compilers-2023.1.0-sb753366rvywq75zeg4ml5k5c72xgj72/compiler/2023.1.0/linux/compiler/lib/intel64/libirc.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped

Running the following patchelf command is known to fix this issue,

patchelf --add-needed libc.so.6 path-to-libirc.so-file

but we would like to not have to negotiate with our HPC provider IT group to implement this workaround.

Can we please get this addressed in the oneAPI installation so that the libirc.so file is properly understood by the dynamic loader to be a shared library.

@srherbener
Copy link
Collaborator Author

The Intel Priority Support ticket number is:

Intel Developer Products Support - Update to Service Request#:06425926

@srherbener
Copy link
Collaborator Author

This ticket has been closed with the explanation that libirc.so is marked static intentionally, and we need to figure out how to use libintlc.so in its place. Not very helpful, but at least we know a little more about this issue.

@srherbener
Copy link
Collaborator Author

After testing PR #1435 the tar issue reported here persists. After some discussion we determined that this issue with tar is unrelated to the libirc.so issue (where libirc.so is shipped as a static library instead of a shared library).

I used the LD_DEBUG=libs setting which shows a trace of the specific shared libraries that are loaded as an executable runs. When I log into Orion and before I source the spack-stack environment the tar command works. After I load the spack-stack environment the tar command crashes. I compared the library loading sequence with tar when it works versus when it doesn't work.

The working tar loads all of its libraries from the system /usr/lib64 location. The crashing tar starts out the same but diverges when it gets to libcrypto.so.3. The comparison of the sequences looks like:

both tar cases:

/usr/lib64/libacl.so.1
...
/usr/lib64/libcrypt.so.2

The good tar then loads

/usr/lib64/libcrypto.so.3

whereas the bad tar loads

<spack-install-path>/intel/2021.9.0/opessl-3.4.0-<hash>/lib64/libcrypto.so.3

Then both tars continue with the same sequence using:

/usr/lib64/libp11-kit.so.0
/usr/lib/64/libgcc_s.so.1

Then diverge again with libz.so.1. The good tar loads

/usr/lib64/libz.so.1`

whereas the bad tar loads

<spack-stack-install>/intel/2021.9.0/zlib-ng-2.2.1-<hash>/lib/libz.so.1

Then the two tars finish up with loading

/usr/lib64/libffi.so.8

So it appears that there are issues with where libcrypto.so.3 and libz.so.1 are loaded from.

In the bad environment, I tried module unload openssl/3.4.0 and module unload zlib-ng/2.2.1. To see if I could get the bad tar to run.

  • Unloading either one of openssl or zlib-ng was not enough to prevent the crash
  • However, unloading both enable the bad tar to run successfully

This seems to indicate that in order to get the Orion environment working, we would need to use external libraries for both

  • libcrypto
  • libz

Is this a viable approach, or will this mess up other parts of the stack?

We discussed preferences for the solution to this issue ordered highest to lowest as:

  1. Replace the offending libraries in spack-stack with the system libraries
  2. Replace the offending libraries in spack-stack with the /apps/spack-managed libraries
    • It looks like taking this approach we would have to replace the openssl package since the libcrypto library is only available through the openssl package
    • This approach might have the same issue with tar
  3. Replace the system tar call in the CRTM cmake config with the cmake builtin tar function

I think option 1 still looks good as long as its okay to replace both libcrypto and libz.

If option 1 doesn't look good, I'm not sure if option 2 is any better and perhaps we should go with option 3.

What do others think?

@climbfuji
Copy link
Collaborator

I support using the system libz instead of a spack-built zlib-ng. I recently made a similar change for the NRL systems (see recently merged PR). As for libcrypto, this might be a problem with the site config itself. See the open draft PR that tried to address the tar issue (I am removing some external openssl and curl there, I think). Maybe that is sufficient to address the inconsistencies w.r.t. libcrypto?

@srherbener
Copy link
Collaborator Author

The issue seems to be that spack-stack has built the exact same versions of libcrypto and libz as the system installed versions. libcrypto is built as part of the openssl package and libz is essentially the zlib-ng package. So wouldn't the fix for this be to force spack-stack to use /usr/lib64/libcrypto.so.3 for building the openssl package, and removing the zlib-ng package altogether and use /usr/lib64/libz.so.1 instead?

@RatkoVasic-NOAA
Copy link
Collaborator

@srherbener for the option one, do you think to manually copy libcrypto.so.3 and libz.so.1 libraries from system to spack-stack?
We should do the same thing for [email protected] as well?

@srherbener
Copy link
Collaborator Author

@srherbener for the option one, do you think to manually copy libcrypto.so.3 and libz.so.1 libraries from system to spack-stack? We should do the same thing for [email protected] as well?

No, instead I'm thinking of putting in configuration like this in the orion intel site packages:

    libcrypto:
      externals:
      - spec: [email protected]
        prefix: /usr
      buildable: false
    libz:
      externals:
      - spec: [email protected]
        prefix: /usr
      buildable: false

which, I'm thinking, should force spack-stack to use the system libcrypto and libz.

I'm trying this out right now on orion to see if concretize works the way I want. I might need to modify what I've posted above, but this is the idea.

@climbfuji
Copy link
Collaborator

Please also or first try without adding the external libcrypto, and the bug fixes (site config update for Orion) from
#1435

@srherbener
Copy link
Collaborator Author

Please also or first try without adding the external libcrypto, and the bug fixes (site config update for Orion) from #1435

I'll switch to base my feature branch on your feature branch so I pick up your changes in #1435.

@srherbener
Copy link
Collaborator Author

@climbfuji, I'm going to try the changes in #1435 along with using the external libz instead of the zlib-ng package first and see if that works. I'm using the recent config update in NRL platforms that replace zlib-ng with the system libz as a guide to do that on Orion in my feature branch. If this works, then I can submit a PR to your feature branch. Does that sound good?

@climbfuji
Copy link
Collaborator

Yes, that sounds great, thanks very much!

@srherbener
Copy link
Collaborator Author

The external libz (using the system libz instead of spack-stack built zlib-ng) did not work to fix the tar issue. I'm going to try adding the forcing of an external libcrypto next (along with the external libz.

@srherbener
Copy link
Collaborator Author

Forcing openssl to use the external libcrypto is proving to be difficult, libcrypto is too much of a integral part of openssl so the build configuration insists on building it. I think the loading of the wrong libcrypto by tar on Orion is a somewhat rare situation because the intended system libcrypto that got linked into tar is libcrypto.so.3 and this happens to match one of the links in the spack-stack openssl build. Using file names with the whole version number would have helped prevent the collision.

It looks like the openssl in the /apps/spack-managed area on Orion is a bit older (version 1.1.1s) than the spack-built openssl (version 3.4.0) but I'll give that a try next. That is, force openssl to be external and pick it up from the /apps/spack-managed area. This should remove the collision (since the [email protected] version provides [email protected]) and allow tar to run successfully.

@srherbener
Copy link
Collaborator Author

Looks like krb5 needs a newer version of openssl compared to the openssl available in the /apps/spack-managed area (which is [email protected]). When using [email protected], the krb5 build gets a lot of undefined reference errors that appear related to openssl:

 >> 7028    /work2/noaa/jcsda/herbener/projects/spack-stack/cache/build_stage/spack-stage-krb5-1.2
             1.3-fm7i6tn75e4zrkkyznbtgkzskjkqfkbu/spack-src/src/plugins/preauth/pkinit/pkinit_crypt
             o_openssl.c:2241: undefined reference to `OPENSSL_sk_free'
  >> 7029    ld: /work2/noaa/jcsda/herbener/projects/spack-stack/cache/build_stage/spack-stage-krb5
             -1.21.3-fm7i6tn75e4zrkkyznbtgkzskjkqfkbu/spack-src/src/plugins/preauth/pkinit/pkinit_c
             rypto_openssl.c:2197: undefined reference to `OPENSSL_sk_new_null'
  >> 7030    ld: /work2/noaa/jcsda/herbener/projects/spack-stack/cache/build_stage/spack-stage-krb5
             -1.21.3-fm7i6tn75e4zrkkyznbtgkzskjkqfkbu/spack-src/src/plugins/preauth/pkinit/pkinit_c
             rypto_openssl.c:2198: undefined reference to `OPENSSL_sk_push'
     7031    ld: pkinit_crypto_openssl.so: in function `cms_signeddata_create':
  >> 7032    /work2/noaa/jcsda/herbener/projects/spack-stack/cache/build_stage/spack-stage-krb5-1.2
             1.3-fm7i6tn75e4zrkkyznbtgkzskjkqfkbu/spack-src/src/plugins/preauth/pkinit/pkinit_crypt
             o_openssl.c:1558: undefined reference to `OPENSSL_sk_new_null'
  >> 7033    ld: /work2/noaa/jcsda/herbener/projects/spack-stack/cache/build_stage/spack-stage-krb5
             -1.21.3-fm7i6tn75e4zrkkyznbtgkzskjkqfkbu/spack-src/src/plugins/preauth/pkinit/pkinit_c
             rypto_openssl.c:1561: undefined reference to `OPENSSL_sk_value'
  >> 7034    ld: /work2/noaa/jcsda/herbener/projects/spack-stack/cache/build_stage/spack-stage-krb5
             -1.21.3-fm7i6tn75e4zrkkyznbtgkzskjkqfkbu/spack-src/src/plugins/preauth/pkinit/pkinit_c
             rypto_openssl.c:1573: undefined reference to `X509_STORE_CTX_set0_trusted_stack'
...

I'm a little bit concerned that trying to use external packages to resolve this issue is going to start unravelling into a major effort.

I already have a solution with changing the CRTM CMake configuration to use the CMake builtin tar function (which has been deemed a lesser desired solution, but does work). The builtin tar feature was introduced in [email protected] and jedi-bundle requires [email protected] or newer, so the jedi-bundle application is good with this solution. However, I realize that this might not be the case with other applications.

I'm inclined to pursue the CMake builtin tar feature as the solution for this issue, but first it would be good to understand better the down side of this solution.

  • One question is how many apps are impacted by this fault, and can these apps also employ the builtin CMake configuration as a solution?

Thanks for you help with this!

@srherbener
Copy link
Collaborator Author

Just had a thought: Is there a way to tell the openssl package to use a file name other than libcrypto.so.3 (which collides with the system /usr/lib64/libcrypto.so.3 file) for its builtin libcrypto library. Eg, use libcrypto_ssl.so.3. That should solve the issue on Orion without any application changes. Would a solution like this be preferred?

@climbfuji
Copy link
Collaborator

I don't know. But I believe that a lot of problems that we are seeing are due to the fact that we set LD_LIBRARY_PATH in our modules. According to the spack developers, that shouldn't be necessary, because spack uses rpath for all its builds.

@srherbener
Copy link
Collaborator Author

Setting LD_LIBRARY_PATH in our spack-stack modules is certainly contributing to this issue. If we didn't set LD_LIBRARY_PATH, I would expect the system tar command to work just fine. Another contributor to this issue is that the system tar command relies on LD_LIBRARY_PATH to find its libraries, but the Orion IT folks might want it that for relocatable libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment