HPC Troubleshooting Guide

This guide covers the most common environment and installation issues on HPC clusters (HiPerGator). For general installation instructions, see installation.md.

Issue 1: Packages Show [MISSING] Despite Being Installed

Symptoms:

[MISSING]  scikit-learn
           /lib64/libstdc++.so.6: version `CXXABI_1.3.15' not found
[MISSING]  matplotlib
           /lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found
[ERROR]    pyvips
           cannot load library: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found
[MISSING]  stackview

Root cause: The system’s /lib64/libstdc++.so.6 is too old and gets loaded before conda’s newer version. The packages are installed but fail to import because their compiled extensions need newer C++ ABI symbols.

Fix:

conda activate KINTSUGI
kintsugi fix-hpc
conda deactivate && conda activate KINTSUGI
kintsugi check

kintsugi fix-hpc deploys conda activation scripts that prepend $CONDA_PREFIX/lib to LD_LIBRARY_PATH, ensuring conda’s libstdc++.so.6 loads first.

Prevention: Use ./scripts/install.sh --hpc or envs/env-hpc.yml which pins libstdcxx-ng>=12.0 and deploys the activation scripts automatically.

Issue 2: kintsugi install all Completes But Packages Are Missing

Symptoms: kintsugi install all prints “Installation complete!” but kintsugi check shows [SKIP] for scanpy, phenograph, instanseg, etc.

Root cause: install all in pip mode does a single pip install -e ".[all_extras]" which can silently drop packages that fail to resolve. Additionally, the pip-only install path skips conda commands needed for CUDA runtime libraries.

Fix (for future installs): On HPC, use install all --conda or (better) use envs/env-hpc.yml:

# Option A: Conda mode (runs conda commands for GPU groups)
kintsugi install all --conda

# Option B: Start fresh with HPC env file (recommended)
conda env remove -n KINTSUGI -y
conda env create -f envs/env-hpc.yml
conda activate KINTSUGI
kintsugi fix-hpc
conda deactivate && conda activate KINTSUGI

Fix (for missing packages in existing env): Install individually to see specific errors:

pip install scanpy
pip install phenograph
pip install instanseg instanseg-torch

Prevention: The install all command now auto-detects HPC environments and switches to conda mode, which installs conda groups first and then pip-only packages individually (avoiding the mega-extras resolution that silently drops packages).

Issue 3: CuPy Imports But CUDA Operations Fail

Symptoms: import cupy succeeds but cupy.array([1.0]) fails with:

ImportError: libcufft.so.10: cannot open shared object file

Root cause: cupy-cuda12x (pip) only provides Python bindings. The actual CUDA runtime libraries (libcufft, libcublas, libcusolver) must be installed separately via conda.

Fix:

conda install cuda-libraries cuda-cudart-dev -c nvidia -y
# Copy headers for CuPy JIT
cp -r $CONDA_PREFIX/targets/x86_64-linux/include/* $CONDA_PREFIX/include/ 2>/dev/null

Prevention: Use envs/env-hpc.yml which includes cuda-libraries and cuda-cudart-dev in the conda section.

Issue 4: PyTorch is CPU-Only

Symptoms:

[ERROR]    torch-build
           CPU-only PyTorch detected! GPU processing will fail.

Root cause: PyTorch was installed from pip without the CUDA index URL, or a later pip operation overwrite the conda-installed GPU version.

Fix:

# Remove existing torch
pip uninstall torch torchvision -y

# Re-install from PyTorch CUDA channel
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128

# Or via conda (preferred on HPC):
conda install pytorch torchvision pytorch-cuda=12.8 -c pytorch -c nvidia -c conda-forge

Prevention: Use envs/env-hpc.yml which installs PyTorch from the pytorch conda channel with pytorch-cuda=12.8. Never run bare pip install torch on HPC — always use the CUDA index URL.

Issue 5: SLURM Jobs Fail With TRES Error

Symptoms: Snakemake SLURM jobs fail immediately with:

error: Invalid --tres-per-task specification

Root cause: SLURM >= 24.11 changed the SLURM_TRES_PER_TASK environment variable format, which conflicts with the Snakemake jobstep executor plugin.

Fix:

kintsugi patch slurm

This patches the snakemake-executor-plugin-slurm-jobstep plugin to unset SLURM_TRES_PER_TASK. Must be re-applied after upgrading the plugin.

Prevention: ./scripts/install.sh --hpc and kintsugi install all auto-apply this patch.

Issue 6: torch-cuda Shows [SKIP] on Login Node

Symptoms:

[SKIP]     torch-cuda (optional)

This is expected. Login nodes have no GPU hardware, so torch.cuda.is_available() returns False. PyTorch and CuPy will work correctly on compute nodes.

To verify the torch build is CUDA-enabled (not CPU-only), check:

python -c "import torch; print(f'CUDA build: {torch.version.cuda}')"

If this prints CUDA build: 12.4 (or similar), you’re fine.

Verification

Run the comprehensive verification script:

./scripts/verify_hpc_env.sh

Or the quick check:

kintsugi check

Getting Help

If none of the above resolves your issue:

  1. Run kintsugi check and save the full output

  2. Run conda list > env_packages.txt to capture installed packages

  3. Report the issue at https://github.com/smith6jt-cop/KINTSUGI/issues