AlphaFold 2.3.2

(ざっくりとした手順のみを記述)

Python 環境構築

miniforge を /apl/alphafold/miniforge3 以下に導入済みとする。AlphaFold のコードは /apl/alphafold/2.3.2 以下に配置。

(base) [user@ccfep4 2.3.2]$ conda install -y -c conda-forge openmm=7.5.1 cudatoolkit==11.2.2 cudnn pdbfixer pip python=3.8
(base) [user@ccfep4 2.3.2]$ conda install -y -c bioconda hmmer==3.3.2 hhsuite==3.3.0 kalign2==2.04
(base) [user@ccfep4 2.3.2]$ CONDA_OVERRIDE_CUDA=11.2 conda install jax==0.3.25 jaxlib=0.3.25=*cuda*
(base) [user@ccfep4 2.3.2]$ pip install absl-py==1.0.0 biopython==1.79 chex==0.0.7 dm-haiku==0.0.9 dm-tree==0.1.6 immutabledict==2.0.0 ml-collections==0.1.0 numpy==1.21.6 pandas==1.3.4 scipy==1.7.0 tensorflow-cpu==2.11.0
(base) [user@ccfep4 2.3.2]$ cd /apl/alphafold/miniforge3/lib/python3.8/site-packages/
(base) [user@ccfep4 site-packages]$ patch -p0 < ../../../../2.3.2/docker/openmm.patch

  • 公式情報の通りに CUDA 11.1.1 を使うと Jax の目的バージョンが入らなかったため、やむをえず 11.2.2 で代用。

DB

2023/1/30 に導入した 2.3.1 のものを流用。pdb_mmcif と pdb_seqres のみ最新のものを導入。

AlphaFold

  • alphafold/common/stereo_chemical_props.txt については以前と同様に配置
  • alphafold/data/tools/hhblits.py に以下のパッチを適用(hhblits のスレッド数を環境変数で指定できるように)

--- hhblits.py.org      2024-02-06 16:27:37.000000000 +0900
+++ hhblits.py  2024-02-07 12:19:54.000000000 +0900
@@ -94,6 +94,14 @@
    self.p = p
    self.z = z

+    n_cpu_env = os.getenv("HHBLITS_NTHREADS")
+    if n_cpu_env:
+        try:
+            n_cpu_env = int(n_cpu_env)
+            self.n_cpu = n_cpu_env
+        except:
+            pass
+
  def query(self, input_fasta_path: str) -> List[Mapping[str, Any]]:
    """Queries the database using HHblits."""
    with utils.tmpdir_manager() as query_tmp_dir:

  • alphafold/data/tools/jackhmmer.py に以下のパッチを適用(jackhmmer のスレッド数を環境変数で指定できるように)

--- jackhmmer.py.org    2024-02-06 16:33:47.000000000 +0900
+++ jackhmmer.py        2024-02-07 12:20:23.000000000 +0900
@@ -87,6 +87,14 @@
    self.get_tblout = get_tblout
    self.streaming_callback = streaming_callback

+    n_cpu_env = os.getenv("JACKHMMER_NTHREADS")
+    if n_cpu_env:
+        try:
+            n_cpu_env = int(n_cpu_env)
+            self.n_cpu = n_cpu_env
+        except:
+            pass
+
  def _query_chunk(self,
                   input_fasta_path: str,
                   database_path: str,

wrapper スクリプト

#!/bin/bash
# Description: AlphaFold non-docker version
# Author: Sanjay Kumar Srikakulam
#         
#
# RCCS notes:
# This script was customized for RCCS by M. Kamiya (IMS).
# original: https://github.com/kalininalab/alphafold_non_docker

# This script is for AlphaFold 2.3.2!
# Former AlphaFold versions may not be compatible with this script!

# RCCS default value
af2root="/apl/alphafold/2.3.2"
data_dir="/apl/alphafold/databases/20240206"

max_template_date="2024-02-06"
benchmark=false
db_preset="full_dbs"
model_preset="monomer"
use_gpu=false
MYOPTS="" # variable for misc options

usage() {
       echo ""
       echo "Usage: $0 <OPTIONS>"
       echo "Required Parameters:"
       echo "-o <output_dir>   Path to a directory that will store the results."
       echo "-f <fasta_path>   Path to a FASTA file containing one sequence"
       echo ""
       echo "Optional Parameters:"
       echo "-a <alphafolddir> Path to alphafold code"
       echo "-d <data_dir>     Path to directory of supporting data"
       echo "-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets (default: 2021-11-05)"
       echo "-Q                show also pTM score etc. (alias of -m monomer_ptm)"
       echo "-b <benchmark>    Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'False')"
        echo "-g                Enable NVIDIA runtime to run with GPUs"
       echo "-a <gpu_devices>  Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: '')"
       echo "-r <relax_tgt>    Choose relax target from all ('all'), most confidential mode ('best'), or skip relaxation ('none')"
       echo "-R                Skip running MSA tools and use precomputed one. NOTE: this will not check if sequence/db/conf have changed."
       echo "-s <seeds per model>  Number of seeds per model for multimer system. (Number of models (usually 5)) * (number of seeds; this param) predictions will be performed. (default: 5)"
       echo "-p <db_preset>    Choose db preset - no ensembling (full_dbs), reduced version of dbs (reduced_dbs) (default: 'full_dbs')"
       echo "-m <model_preset> Choose model preset - monomer model (monomer), monomer with extra ensembling (monomer_casp14), monomer model with pTM head (monomer_ptm), or multimer model (multimer) (default: 'monomer')"
       echo ""
       exit 1
}

while getopts ":a:d:o:f:t:a:p:s:m:r:bgQR" i; do
       case "${i}" in
       a)
               echo "INFO: set AF2 root to $OPTARG"
               af2root=$OPTARG
       ;;
       d)
               echo "INFO: set database root to $OPTARG"
               data_dir=$OPTARG
       ;;
       o)
               output_dir=$OPTARG
       ;;
       f)
               fasta_path=$OPTARG
       ;;
       t)
               max_template_date=$OPTARG
       ;;
        b)
               benchmark=true
       ;;
       g)
               use_gpu=true
       ;;
       Q)
               echo "INFO: set model_preset=monomer_ptm"
               model_preset="monomer_ptm"
       ;;
       a)
               gpu_devices=$OPTARG
       ;;
       p)
               db_preset=$OPTARG
       ;;
       m)
               model_preset=$OPTARG
       ;;
       r)
               MYOPTS="$MYOPTS --models_to_relax=$OPTARG"
        ;;
       s)
               MYOPTS="$MYOPTS --num_multimer_predictions_per_model=$OPTARG"
       ;;
       R)
               MYOPTS="$MYOPTS --use_precomputed_msas=True"
       ;;
       esac
done

# Parse input and set defaults
if [[ "$data_dir" == "" || "$output_dir" == "" || "$fasta_path" == "" ]] ; then
   usage
fi

if [[ "$db_preset" != "full_dbs" && "$db_preset" != "reduced_dbs" ]] ; then
   echo "Unknown db_preset! Using default ('full_dbs')"
   db_preset="full_dbs"
fi

if [[ "$model_preset" != "monomer" && "$model_preset" != "monomer_casp14" && "$model_preset" != "monomer_ptm" && "$model_preset" != "multimer" ]]; then
   echo "Unknown model_preset! Using default ('monomer')"
   model_preset="monomer"
fi

alphafold_script="$af2root/run_alphafold.py"
if [ ! -f "$alphafold_script" ]; then
   echo "Alphafold python script $alphafold_script does not exist."
   exit 1
fi

if "$use_gpu" ; then
 MYOPTS="$MYOPTS --use_gpu_relax=True"
else
 MYOPTS="$MYOPTS --use_gpu_relax=False"
fi

if [[ "$gpu_devices" ]] ; then
   export CUDA_VISIBLE_DEVICES=$gpu_devices
fi

export TF_FORCE_UNIFIED_MEMORY='1'
export XLA_PYTHON_CLIENT_MEM_FRACTION='4.0'

# Binary path (change me if required)
hhblits_binary_path=$(which hhblits)
hhsearch_binary_path=$(which hhsearch)
jackhmmer_binary_path=$(which jackhmmer)
kalign_binary_path=$(which kalign)

MYOPTS="$MYOPTS --hhblits_binary_path=$hhblits_binary_path"
MYOPTS="$MYOPTS --hhsearch_binary_path=$hhsearch_binary_path"
MYOPTS="$MYOPTS --jackhmmer_binary_path=$jackhmmer_binary_path"
MYOPTS="$MYOPTS --kalign_binary_path=$kalign_binary_path"

# uniref30 path
uniref_new=$(find $data_dir -maxdepth 1 -name 'UniRef*')
if [ ! -z "$uniref_new" ]; then
 uniref_name=$(basename $uniref_new)
 uniref30_database_path="$data_dir/$uniref_name/$uniref_name"
elif [ -d "$data_dir/uniref30" ]; then
 uniref30_database_path="$data_dir/uniref30/UniRef30_2021_03"
fi

# bfd path
if [[ "$db_preset" == "reduced_dbs" ]] ; then
 small_bfd_database_path="$data_dir/small_bfd/bfd-first_non_consensus_sequences.fasta"
 MYOPTS="$MYOPTS --small_bfd_database_path=$small_bfd_database_path"
 # uniref30 not necessary
else
 bfd_database_path="$data_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt"
 MYOPTS="$MYOPTS --bfd_database_path=$bfd_database_path"
 # uniref30 required
 MYOPTS="$MYOPTS --uniref30_database_path=$uniref30_database_path"
fi

# Path and user config (change me if required)
if [ -f $data_dir/mgnify/mgy_clusters_2022_05.fa ]; then
 mgnify_database_path="$data_dir/mgnify/mgy_clusters_2022_05.fa"
else
 mgnify_database_path="$data_dir/mgnify/mgy_clusters.fa"
fi
template_mmcif_dir="$data_dir/pdb_mmcif/mmcif_files"
obsolete_pdbs_path="$data_dir/pdb_mmcif/obsolete.dat"
uniref90_database_path="$data_dir/uniref90/uniref90.fasta"

MYOPTS="$MYOPTS --mgnify_database_path=$mgnify_database_path"
MYOPTS="$MYOPTS --template_mmcif_dir=$template_mmcif_dir"
MYOPTS="$MYOPTS --obsolete_pdbs_path=$obsolete_pdbs_path"
MYOPTS="$MYOPTS --uniref90_database_path=$uniref90_database_path"

# for multimer (pdb70 must not be specified this case)
if [[ "$model_preset" == "multimer" ]]; then
 echo "INFO: appending database paths for multimer model..."
 uniprot_database_path="$data_dir/uniprot/uniprot.fasta"
 MYOPTS="$MYOPTS --uniprot_database_path=$uniprot_database_path"
 pdb_seqres_database_path="$data_dir/pdb_seqres/pdb_seqres.txt"
 MYOPTS="$MYOPTS --pdb_seqres_database_path=$pdb_seqres_database_path"
else
 pdb70_database_path="$data_dir/pdb70/pdb70"
 MYOPTS="$MYOPTS --pdb70_database_path=$pdb70_database_path"
fi

#echo $MYOPTS

# Run AlphaFold with required parameters
$(python $alphafold_script --data_dir=$data_dir --output_dir=$output_dir --fasta_paths=$fasta_path --max_template_date=$max_template_date --db_preset=$db_preset --model_preset=$model_preset --benchmark=$benchmark --logtostderr $MYOPTS)
 

メモ

  • CUDA のバージョンが公式の指定と違っている点にご注意ください。
    • 確認した範囲では動作に問題はなさそうです。
  • HHBLITS_NTHREADS と JACKHMMER_NTHREADS の環境変数で hhblits と jackhmmer のスレッド数を変更できます
    • 少し値を大きくすると速度が出る場合があるかもしれません。
      • (速度の向上は当センターで使用している lustre ファイルシステムのパフォーマンスによるかもしれません)
    • 値を小さくした場合は速度が落ちる可能性が高いと思われます。スレッドを大量に使っても速度は出ません。