(Last update: Sep 19, 2023) Information about the new system to be launched in February, 2023.
Common Questions
- For formchk of Gaussian, please check this FAQ item.
- For "Cgroup mem limit exceeded" message, please check this FAQ item.
- For "No space left on device" error, please check this FAQ item.
- Please note that the memory amount per CPU core is about half of the previous system. You need to use twice as many cores as the previous system to get comparable amount of memory.
- There are additional limitations for core (ncpus<64 or gpu jobs) and jobtype=largemem jobs. The limit values can be shown by "jobinfo -s" command.
- If you run too many jobs in a short period of time, penalty might be imposed.
- If you are planning to run thousands of jobs in a day, please merge them.
- In case "setvars.sh" is loaded in ~/.bashrc, sftp (including WinSCP) failed to connect to ccfep due to the output from that script.
- You may be able to avoid this error by discarding output from setvars.sh like "source ~/intel/oneapi/setvars.sh >& /dev/null"
- (Load setvars.sh only if $PS1 is not null should work. Just move that line to ~/.bash_profile may also work.)
- Jobs failed or frozen with "Failed to shmget with IPC_PRIVATE, size 20971520, IPC_CREAT; errno 28:No space left on device" message when HPC-X was employed.
- "sys.c:883 UCX ERROR shmget(size=4296704 flags=0x7b0) for mm_recv_desc failed: No space left on device, please check shared memory limits by 'ipcs -l'" message will be shown together.
- This can happen on cp2k jobs when many process are employed.
- This seems to be caused by hcoll library employed in HPC-X. Hcoll has run out of the number of shared memory segments (current maximum is 4096).
- This can be avoided by adding "-mca coll_hcoll_enable 0" option to mpirun.
- It is possible to increase the number of maximum shared memory segments. If you really need that, please ask us.
Known Issues (Last Update: Sep 19, 2023)
- (Completed items are moved to the bottom of this page.) (Mar 6)
- Libhcoll having a problem with Intel MPI. (Apr 26)
- Confirmed with Intel MPI 2021.8. Other versions may also be affected.
- When the problem occured, program simply froze without any log messages.
- In this case, it seems the program was frozen in the libhcoll code called from MPI function of scalapack of Intel MKL.
- This problem can be avoided by setting "export I_MPI_COLL_EXTERNAL=no" (bash) before running mpirun.
- According to the official manual, the default value of this parameter is "no". However, setting this env variable to "no" clearly changes the behavior of the program.
- In debug mode (using I_MPI_DEBUG), message like '[0] MPI startup(): detected mlx provider, set device name to "mlx_hcoll"' will be shown in the default setting. However, this message won't be shown if "export I_MPI_COLL_EXTERNAL=no" is done.
- Intel MPI probably employs "hcoll" if it exists?
- hcoll library included in HPC-X output warning (error?) messages starting from [LOG_CAT_P2P] or/and [LOG_CAT_MCAST] in some cases. (Mar 1)
- You can suppress these messages by adding "-mca coll_hcoll_enable 0" to mpirun. (disable hcoll)
- However, there can be a problem in communication between processes when these messages are shown. (e.g. GAMESS built with HPC-X 2.11) Please do not try to delete messages without investigation.
- (In GAMESS case, just suppressing messages do not improve performance. It only made the problem harder to detect.)
- If the problem is only for MCAST, messages can be removed by passing "-x HCOLL_ENABLE_MCAST=0" to mpirun. It is likely that this is not a serious problem in this case.
- This can be caused by "knem" module (enabled both on compuation and frotnend nodes). Not yet investigated in this center.
- /apl/openmpi/3.1.6 is free from hcoll. These messages won't be shown.
-
- "Disk option" of installed Molpro 2022.3.0, 2022.2.2, 2021.3.1 does not work properly. (May 17)
- (See this official page for details about disk option.)
- molpro 2022.3.0 could have a problem related to hcoll. We are now investigating this issue. (Sep 12)
- According to the description in the official documentation, openmpi may not work with
-–ga-impl disk
when multiple parallel Molpro calculations are executed simultaneously on a node. (Sep 12)
- 2022.3.0-mva (installing now) may be free from this issue. (May 18 installed)
- Open MPI 4.1.5/4.1.4 without hcoll seems to be ok. However, sometimes MCSCF (PNO-LCCSD) jobs abort with "ERROR: Error setting an MPI file to atomic mode" message. We have no idea about this issue now. This happens even when a node is exclusively used by single job. (Sep 19)
- The error may be concerned with MPIIO.
- Intel MPI 2021.7.1, HPC-X 2.11 (Open MPI 4.1.4), and HPC-X 2.13.1 (Open MPI 4.1.5) are affected by this issue.
- Intel MPI 2021.5.1, 2021.8.0, and 2021.9.0 are not affected. The problem happens only when runtime library of 2021.7.1 is used.
- Open MPI 3.1.6 is free from this issue. However, this version shows strange error message unrelated to this issue.
- HPC-X 2.11 and 2.13.1 (Open MPI 4.x) sometimes hang due to hcoll. However, even when hcoll is disabled, the program sometimes hangs.
- On the other hand, Intel MPI which also uses hcoll (different version, though) does not hang. Hcoll may not be the main cause of this issue.
- HPC-X 2.11: hcoll v4.7.3208 (1512342c)
- HPC-X 2.13.1: hcoll v4.8.3220 (2e96bc4e)
- Intel MPI: hcoll v4.8.3221 (b622bfb) (/opt/mellanox/hcoll)
- MVAPICH 2.3.7 works fine without problem.
- Molpro versions before 2021.1 are not affected by this issue (disk option not implemented). In 2021.1, you are affected by this issue only if you explicitly enable disk option. (Disk option becomes the default setting since 2021.2.)
- You can disable this "disk option" by adding "--ga-impl ga" option to molpro. This is properly added in RCCS sample job scripts.
How to load my oneAPI environment (Mar 3 update)
oneAPI Base Toolkit can be downloaded from this page. Compilers, MKL, and MPI are included. Please use offline or online one for Linux. You need to install oneAPI HPC Toolkit if you need fortran compiler (ifort, ifx). Please download from this page. (Mar 3)
For bash users, following module method can work. But just loading ~/intel/oneapi/setvars.sh is easier.
You can use individual component of oneAPI such as compilers and MKL by loading "vars.sh" file in each component directory.
(e.g. source ~/intel/oneapi/compiler/latest/env/vars.sh)
We here introduce a simple way using module command. There maybe several ways to do it.
It is assumed that oneAPI is already installed under ~/intel.
$ cd ~/intel/oneapi
$ sh modulefiles-setup.sh
$ cd modulefiles/
$ module use .
$ module save
This will gather oneapi module files into modulesfiles/ directory and that directory is registered in the module search path.
Final module save command saves the setting. The saved environment will be restored upon login in RCCS system.
If you want to use Intel compilers, you need to run "module load compiler/latest".
(Please check "module avail" for details about packages and their versions.)
You can load compilers and other packages before "module save".
In this case, you can use compilers and libraries immediately after your next login.
$ cd modulefiles/
$ module use .
$ module load compiler/latest
$ module load mkl/latest
$ module load mpi/latest
$ module save
If you want to remove saved module environment, please try "module saverm". Please note that your saved environment is independent from the system default setting. Changes in the system default setting do nothing to your environment.
Hardware
node type |
CPU, GPU |
memory [GiB] |
# of nodes |
Local scratch
on NVMe SSD [GB] |
note |
TypeC |
AMD EPYC 7763 (64 cores 2.45 GHz) [2 CPUs/node] |
256 |
804 |
1536 |
for vnode/core jobs |
TypeF |
1024 |
14 |
for large memory vnode jobs |
TypeG |
AMD EPYC 7763 (64 cores 2.45 GHz) [2 CPUs/node]
NVIDIA A100 NVLink 80 GB [8 GPUs/node] |
256 |
16 |
for GPU jobs |
- Node jobs will be assigned to vnode(s). A vnode has 64 cores.
- In addition to global scratch space (corresponding to /work in the previous system), local scratch space on NVMe SSD is available for jobs.
Storage
14.8 PB (Lustre)
Interconnect
InfiniBand
Queue Classes and Queue Factor for FY2023
Node jobs will be assigned vnodes (64 cores/vnode). A computation node consists of two vnodes.
queue
name
(jobtype) |
# of total vnodes
(# of total cores) |
limit for a job |
memory
[GiB/core] |
group limit |
queue factor |
assign
unit |
node type |
assigned points |
cores/GPUs |
# of jobs |
H
(largemem) |
28 vnodes
(1,792 cores) |
1-14 vnodes
(64-896 cores) |
7.875 |
7,200,000+
2,400,000+
720,000+
240,000+
-240,0024 |
9,600/64
6,400/42
4,096/28
3,200/12
768/8 |
1,000 |
60 points / (1 vnode * 1h) |
vnode |
TypeF |
H
(vnode) |
1,248+ vnode
(79,872+ cores) |
1-50 vnodes
(64-3,200 cores) |
1.875 |
45 points / (1 vnode * 1h) |
TypeC |
H
(core) |
200+ vnodes
(12,800+ cores) |
1-63 cores |
1 point / (1 core * 1h) |
core |
H
(gpu) |
32 vnodes
(2,048 cores
128 GPUs) |
1-48 GPUs
1-16 cores/GPU |
60 points / (1 GPU * 1h)
1 point / (1 core * 1h) |
TypeG |
HR*
[private] |
- |
- |
1.875 |
- |
- |
- |
45 points / (1 vnode * 1h) |
vnode
|
TypeC |
- Jobs must be finished before the scheduled maintenance.
- Only around half of available nodes may accept long jobs (more than a week).
- You can omit jobtype in your jobscript except for jobtype=largemem; other types can be judged from the requested resources.
- 80 nodes of TypeC (160 vnodes) will be shared by "vnode" and "core" jobs.
- Short "vnode" jobs can be assigned to a "largemem" node.
- Short "core" jobs can be assigned to a "gpu" node.
- (Queue factors during FY2022 is different from these values.)
How to login and how to send/recv files
Once user account is assigned, you can login to the login node (ccfep.ims.ac.jp).
Login guide is available in this page. The procedure itself is almost the same as the previous system.
How to submit jobs (jsub)
You need to prepare jobscript to submit a job.
Creating job script from scratch is not an easy task. We recommend you to use a sample jobscript as a template.
Links in "Job Submission Guides" section of the quick start guide page may be helpful in this regard.
In case of Gaussian, please check g16sub guide below. Some resource definition examples are shown below.
example 1: use 5 nodes (640 (128*5) cores, 320 (64*5) MPI)
#PBS -l select=5:ncpus=128:mpiprocs=64:ompthreads=2
example 1.5: use 10 vnodes (64 for each) (640 (64*10) cores, 320(32*10) MPI)
#PBS -l select=10:ncpus=64:mpiprocs=32:ompthreads=2
example 2: 16-core (16 MPI)
#PBS -l select=1:ncpus=16:mpiprocs=16:ompthreads=1
example 3: 16-core (16 OpenMP) + 1 GPU
#PBS -l select=1:ncpus=16:mpiprocs=1:ompthreads=16:ngpus=1
note: there are 8 GPUs in a node. # of CPU cores per GPU (ncpus/ngpus) must be <= 16.
example 4: 64 cores (1 vnode), large memory node (~500 GB of memory / vnode)
#PBS -l select=1:ncpus=64:mpiprocs=32:ompthreads=2:jobtype=largemem
Job type specification, jobtype=largemem, is necessary for this case.
Changes from the previous system
- You can omit queue specification (-q H).
- You can also omit jobtype from the resource definition except for large memory jobs (jobtype=largemem).
- You can use local disk as scratch on computation nodes. (Can't be accessed directory from other nodes) The path is /lwork/users/${USER}/${PBS_JOBID}. This directory is removed at the end of the job.
- The capacity is 11.9 GB * ncpus, where this ncpus is # of cpu cores available to you in this node (not total cpu core count of your jobs).
- in case of select=2:ncpus=64:mpiprocs=64..., 64 * 11.9 GB will be available in each node.
- You can use huge scratch space of /gwork/users/${USER}. This is corresponding to /work/users/${USER} of the previous system. This disk is shared among nodes.
How to check job status (jobinfo)
Status of submitted jobs can be checked by jobinfo command at login node (ccfep).
Basic usage and sample output are shown below. You don't need to specify queue name (-q H).
In other respects, the new one is almost identical to the previous system's one.
jobinfo -c
You can see the latest status of your jobs. However, some details such as number of GPUs are not available.
Combinations with other options are also restricted.
[user@ccfep ~]$ jobinfo -c
--------------------------------------------------------------------------------
Queue Job ID Name Status CPUs User/Grp Elaps Node/(Reason)
--------------------------------------------------------------------------------
H 4008946 sample3.csh Run 16 uuu/--- 1:51:14 ccc001
H 4008952 sample0.sh Run 16 uuu/--- 1:51:08 ccc001
H 4010452 sample-gpu.sh Queue 1 uuu/--- 0:02:40 (gpu)
--------------------------------------------------------------------------------
Job Status at 2023-01-28 14:00:12
jobinfo
You can get detailed information of your jobs. There will be a delay of up to several minutes, though.
(You may want to use this combined with -m or other options.)
[user@ccfep ~]$ jobinfo
--------------------------------------------------------------------------------
Queue Job ID Name Status CPUs User/Grp Elaps Node/(Reason)
--------------------------------------------------------------------------------
H(c) 4008946 sample3.csh Run 16 uuu/ggg 1:51:14 ccc001
H(c) 4008952 sample0.sh Run 16 uuu/ggg 1:51:08 ccc001
H(g) 4010452 sample-gpu.sh Queue 1+1 uuu/ggg 0:02:40 (gpu)
--------------------------------------------------------------------------------
Job Status at 2023-01-28 14:00:12
For the latest status, please use '-c' option without '-m', '-w', '-s'.
jobinfo -s
Summary of CPU and GPU usages of you and your group will be shown. Queue summary information is also shown.
[user@ccfep ~]$ jobinfo -s
User/Group Stat:
--------------------------------------------------------------------------------
queue: H | user(qf7) | group(za0)
--------------------------------------------------------------------------------
NJob (Run/Queue/Hold/RunLim) | 0/ 0/ 0/- | 0/ 0/ 0/2560
CPUs (Run/Queue/Hold/RunLim) | 0/ 0/ 0/- | 0/ 0/ 0/2560
GPUs (Run/Queue/Hold/RunLim) | 0/ 0/ 0/- | 0/ 0/ 0/48
core (Run/Queue/Hold/RunLim) | 0/ 0/ 0/600 | 0/ 0/ 0/600
--------------------------------------------------------------------------------
note: "core" limit is for per-core assignment jobs (jobtype=core/gpu*)
Queue Status (H):
----------------------------------------------------------------------
job | free | free | # jobs | requested
type | nodes | cores (gpus) | waiting | cores (gpus)
----------------------------------------------------------------------
week jobs
----------------------------------------------------------------------
1-4 vnodes | 706 | 90368 | 0 | 0
5+ vnodes | 506 | 64768 | 0 | 0
largemem | 14 | 1792 | 0 | 0
core | 180 | 23040 | 0 | 0
gpu | 16 | 2048 (128) | 0 | 0 (0)
----------------------------------------------------------------------
long jobs
----------------------------------------------------------------------
1-4 vnodes | 326 | 41728 | 0 | 0
5+ vnodes | 226 | 28928 | 0 | 0
largemem | 7 | 896 | 0 | 0
core | 50 | 6400 | 0 | 0
gpu | 8 | 1024 (64) | 0 | 0 (0)
----------------------------------------------------------------------
How to submit Gaussian jobs (g16sub/g09sub)
Usually, jobs are submitted by jsub command explained above. However, special command g16sub is available for Gaussian16.
g16sub will generate jobscript from your Gaussian input, and the submit the generated job.
In the default setting, 8 cores will be used for the calculation with 72 hours time limit.
(g09sub command for Gaussian09 is also available. The usage is almost identical to g16sub.)
basic usage ( 8 cores、72 hours)
[user@ccfep somewhere]$ g16sub input.gjf
more cores, more longer (16 cores、168 hours)
[user@ccfep somewhere]$ g16sub -np 16 --walltime 168:00:00 input.gjf
note
- If you want to use large memory node (jobtype=largemem) you need to add "-j largemem".
- Jobtype=vnode will be used if -np 64 or -np 128 is specified.
Compiler Environment
gcc, aocc, NVIDIA HPC SDK are already installed.
For Intel oneAPI, only the libraries are installed. Compilers (icc, ifort, icpc etc.) are not installed.
If you need Intel compilers, please install oneAPI Base Toolkit or/and oneAPI HPC Toolkit by yourself into your directory.
For gcc, system default one (8.5) and gcc-toolset ones (versions 9.2, 10.3, and 11.2) are installed.
You can use gcc-toolset gccs by module command. (e.g. module load gcc-toolset/11)
Software related issues
List of installed software is available in Package Program List page.
You can see the similar list by "module avail" command. (To quit module avail, press "q" key or scroll to the bottom of the page.)
Other minor notices are listed below.
- Many software is quitting preparation of config script in csh. We recommend csh users to use module command.
-
module command (Environment Modules)
- In jobscript, csh users need to run "source /etc/profile.d/modules.csh" before using module command.
- In the jobscript, "source /etc/profile.d/modules.csh" is necessary for /bin/bash users.
- ". /etc/profile.d/modules.sh" is necessary for /bin/sh or /bin/bash jobscript if your login shell is /bin/csh
- You should add -s to module command in the script. (e.g. module -s load openmpi/3.1.6)
- You can save current module status by "module save" command. The saved status will be restored automatically upon login.
Completed Issues
- waitest is installed in /usr/local/bin (Sep 4)
"remsh" command to check local scratch (/lwork) and ramdisk (/ramd/users/${USER}/${PBS_JOBID}) of computation nodes is not yet available.(done)
joblog is in preparation.
jobinfo -s, jobinfo -n, jobinfo sometimes show error message regarding json. (Feb 15) This should have been fixed. (Feb 16)
- jobinfo -c won't be affected by this issue.
In case the error is shown, please run the same command again. The correct result will be shown. We will fix this issue soon. (Feb 15) Done. (Feb 16)
- New g16sub/g09sub uses local disk /lwork as the scratch space in default setting. If option "-N" is added, shared disk /gwork (large but slow) will be used as the scratch space instead of local disk /lwork (small but fast). (Feb 1)
- AlphaFold 2.3.1 is available. (Feb 6)
- remsh is installed (/usr/local/bin/remsh). (Feb 6)
- joblog is installed (/usr/local/bin/joblog). (Feb 6)
- CPU points of jobs in this February are also shown, but the points are not consumed.
- NBO 7.0.10 installed. Gaussian 16 C.01, C.02 use it for NBO7. (Feb 14)
- Python 3.10 environment (miniforge) was prepared in /apl/conda/20230214. Please source /apl/conda/20230214/conda_init.sh or /apl/conda/20230214/conda_init.csh to load the environment. (Feb 14)
- Additional limit will be added for jobtype=largemem during the maintenance on Mar 6 (Mon). (Feb 24)
- The limit will be 896 cores (7 nodes) per group.
- CPU core allocation rule for jobtype=core and ompthreads > 1 modified. (Feb 20)
- In case of OpenMPI, "mpirun --map-by slot:pe=${OMP_NUM_THREADS}" or "mpirun --map-by numa:pe=${OMP_NUM_THREADS}" would work fine. An MPI process and its OpenMP threads will be assigned to a single NUMA node (consists of 16 cores).
- (The mapping setting above should be applicable to jobtype=vnode (ncpus=64 or ncpus=128) runs. There can be some differences in performance between slot and numa assignment.)
- In case of ncpus=20:mpiprocs=5:ompthreads=4, 4 cores from each of five NUMA nodes are available to your job.
- (The rule heavily depends on # of mpiprocs. Please ask us if you have problem about this.)
- For gpu jobs, --map-by slot:pe=${OMP_NUM_THREADS} would work if ompthreads > 1 && ncpus/ngpus <= 8 && ngpu <=2. (Feb 22)
- If you have trouble with parallel performance (especially about OpenMP), please ask us. (Feb 22)
- Intel MPI problem was solved by modifying parameters of queuing system. (Feb 3)
- export I_MPI_HYDRA_BOOTSTRAP=pdsh is still necessary parameter for multi node jobs. This is thus added to the default setting. You don't need to set this manually. (Feb 3)
GRRM17 MPI run does not work. We are now investigating this issue. Normal run of GRRM17 (without MPI) works fine.
At this time, it may be safer to avoid Intel MPI if possible. Intel MPI failed to launch process in some cases.
When you use Intel MPI installed in your directory, setting "I_MPI_HYDRA_BOOTSTRAP" environment variable to "ssh" might solve the problem.
export I_MPI_HYDRA_BOOTSTRAP=pdsh (bash case) seems be better. However, we have already identified cases where this does not fix the problem. (Feb 1)
HPC-X 2.13.1 has problems in many applications. We are planning to review software using HPC-X 2.13.1. (Feb 7)
- Survey completed. (Feb 13)
- Problems related to HPC-X 2.13.1 are not found found for cp2k-9.1,
genesis (cpu), lammps (CPU), molpro, namd (CPU), nwchem, openmolcas, siesta. (Feb 25)
- (If you know issues of those application, please tell us.)
- (namd gpu version does not use MPI... sorry.)
- molpro may have problem with hpc-x 2.13.1 (Sep 15)
namd-2.14 (CPU version) failed when large number of MPI processes are employed. solved. (Feb 22)
- Switching runtime library of OpenMPI to HPC-X 2.11 (OpenMPI-4.1.4) seems to fix the problem.
- Sample and module files are modified. Please use these fixed ones.
genesis (CPU version) failed when large number of MPI processes are employed. solved. (Feb 25)
- Switching runtime library of OpenMPI to HPC-X 2.11 (OpenMPI-4.1.4) seems to fix the problem.
- Sample and module files are modified. Please use these fixed ones.
When performing a multi-node parallel computation of Gromacs (2021.4, 2021.6, 2022.4) built with gcc and HPC-X 2.13.1 (OpenMPI-4.1.5), there was a case where the process did not terminate although the computation seemed to be finished. solved (Jan ??)
- In this case, the problem was solved by using HPC-X 2.11 instead of 2.13.1.
Problem identified for AMBER (pmemd GPU) built with gcc+HPC-X 2.13.1. (Feb 1) solved (Feb 1)
- Just changing runtime library to HPC-X 2.11 seems to fix the problem. (Feb 1)
- module file of AMBER was modified. Now amber uses HPC-X 2.11 for runtime library. (Feb 1)
QE-6.8 where Intel compiler+HPC-X 2.13.1 are used failed almost certainly in multinode jobs. (Feb 7) solved (Feb 7)
- Changing runtime library from HPC-X 2.13.1 to HPC-X 2.11 fixed the problem. (Feb 7)
lammps 2022-Jun23 (GPU version) crashed with MPI error. (Feb 8) solved (Feb 10)
- changing runtime library from HPC-X 2.13.1 to HPC-X 2.11 solved the problem. (Feb 10)
- In addition, we have prepared a wrapper script for efficient handling of multiple GPUs in a node. (Feb 10)
genesis 2.0.3-cuda crashed with MPI error if multiple nodes involved. (Feb 10) solved (Feb 10)
- changing runtime library from HPC-X 2.13.1 to HPC-X 2.11 solved the problem. (Feb 10)
- /usr/bin/python3 (currently 3.9) will be replaced with python 3.6.8 in the maintenance on Mar 6 (Mon). (Feb 7)
- This is to fix the various problems related to python.
- Applications depending on python 3.9 (OpenMolcas and NWchem-7.0.2) will also be replaced on Mar 6.
- /usr/bin/perl (currently version 5.32) will replaced with perl version 5.26 in the maintenance on Mar 6 (Mon). (Mar 3)
Currently it is not possible to send files via scp to other compute nodes during a job. We are currently working on it. (Feb 2) Fixed (Feb 6)
There is a problem in gamess parallel run. We have been investigating this issue. (Feb 9)
- Please use /apl/gamess/2022R2-openmpi or /apl/gamess/2021R1-openmpi for GAMESS parallel runs for the time being. (Feb 10)
They will be /apl/gamess/2022R2 and /apl/gamess/2021R1 in the maintenance on Mar 6 (old ones will be removed). (Feb 10)
- N
ote: inter-node communications are not optimized in these openmpi versions (please ignore UCX error messages). This problem will be addressed in the preparation of next version along with MPI library configuration. (Feb 10)
- On Mar 6, /apl/gamess/2022R2 and /apl/gamess/2021R1 will be replaced with Open MPI 3 version. (Feb 24)
- This version is free from UCX issue above; multi-node parallel performance is better than the others.
- Please don't try oversubscribing (e.g. ncpus=32:mpiprocs=64). It doesn't improve the performance (even when "setenv OMPI_MCA_mpi_yield_when_idle 1" is active). Although the number of computation process is a half of the total processes, it shows a better performance than the others.
nwchem has problem with multinode runs (TDDFT case is confirmed). HPC-X 2.13.1 is unrelated to this issue. (Feb 9)
- Please use /apl/nwchem/7.0.2-mpipr for NWChem 7.0.2. (Feb 13)
- In case of reactionplus, please use /apl/reactionplus/1.0/nwchem-6.8-mpipr. (Feb 13)
- Samples and module files are modified to load new ones (-mpipr).
- In the maintenance on Mar 6, they will be re-installed as /apl/nwchem/7.0.2 and /apl/reactionplus/1.0/nwchem-6.8.
This may be due to ARMCI_NETWORK=OPENIB setting. We have been investigating issue. (Feb 9)
- Changing ARMCI_NETWORK from OPENIB to MPI-PR fixes the problem. (Feb 13)
- OpenMolcas is left unchanged for the time being. (Feb 13)
- Test 908 works fine with 16 MPI processes, but failed with 32 MPI processes. This depends purely on number of processes; intra- or inter-node is nothing related. (Feb 13)
This may be due to --with-openib setting of GlobalArrays. We have been investigating issue. (Feb 10)
- --with-mpi-pr or --with-mpi3 one does not work at all. --with-openib or --with-mpi (default) one works if number of processes is enough small. (Feb 13)
- Using Intel MPI instead of Open MPI does not help. Probably not a problem of MPI implementation. OpenMP works fine. (Feb 13)
- MPI version of Siesta is sometimes very slow when Open MPI and MKL are employed. We leave it unchanged for the time being because we couldn't find a solution. (Feb 13)
- Intel MPI version does not work properly. (Feb 13)
- Not employing MKL and Intel MPI seems to improve the stability of computation speed. But it is not remarkable. (Feb 13)
- Intel Compiler + Scalapack (i.e. without MKL) was not tested because scalapack tests were not passed in this case. (Feb 13)
If non-ASCII characters are involved in your job submission directory path, jobinfo -w can't show the correct path. (Feb 21)
- On Feb 17-18, jobinfo output error messages due to this problem. This jobinfo error is solved now. (Feb 21)
- Some non-ASCII characters can still cause the error of jobinfo. (Apr 4)
- This problem was fixed (Apr 7)
- Installation of the following applications is completed. (May 8)
DIRAC 22(23 will be installed), DIRAC 23(done), CP2K 2023.1(done), Quantum ESPRESSO 7.1(7.2 will be installed), Quantum ESPRESSO 7.2(done), Julia 1.8.5(done)
Jobs involved in system trouble might become "Hold" state unexpectedly and jobinfo says that is due to the "(error)". (Feb 15) This bug was fixed by adding patch to the corresponding program.
- This event itself had been easily solved by just rerunning the job in the previous system. In contrast, rerun of the job couldn't solve the problem in the new system. (Feb 15)
- (Rerunning of job is possible. But after the completion, it returns to the original error state.)
- If you don't want rerunning of the jobs, please add "#PBS -r n" in your job script. (Please also check reference manual.)
- In some cases, you couldn't remove those jobs by using jdel command. If you want to delete those jobs, please tell us job IDs. We will force to remove these jobs. (Feb 15)
- (You can leave them if you don't mind. They will be removed in the maintenance on Mar 6.)
- We are still investigating this issue. (Feb 15)
- The (error) reason disappeared after the trouble of queuing system on Mar 2. However, there are no other changes in the status of jobs. The problem still exists. (Mar 3)