FAQ
FAQ
- What will happen if the resource usage exceeds the limit
- How are CPU points for a job calculated?
- Available memory amount for a job
- How to add/remove members?
- How to build/debug software in RCCS?
- How to know limitation on CPU/GPU usage?
- Cannot login via SSH / Login from outside Japan during Bussiness Trip
- We need extra CPU points / disk space
- How to change login shell?
- My jobs have been waiting for very long time
- "Cgroup mem limit exceeded" message
- "No space left on device" error
- Got an email titled "[RCCS] 障害発生と影響を受けたジョブに関するお知らせ"
- Can't connect via sftp (including WinSCP etc.) with "Received message too long" error
- g16 immediately crashed with l1.exe: Permission denied
- Cannot use formchk (Gaussian checkpoint file converter)?
- formchk failed with insufficient memory error
- Python environment (version, libraries) construction
- Intel Compilers (ifort, icc, icpc) not available
- Are there job script header samples?
- Request installation of software
- SSH Connection Error Messages
What will happen if the resource usage exceeds the limit
In case group CPU points run out, jobs of that group will be removed after 24+ hours of that points exceedance.
It will take up to 10 minutes before the system updates the usage status (i.e. before the update of showlim command output). For example, if your disk usage limit is exceeded, you need to reduce the usage by removing files. Still you need to wait up to 10 minutes before submitting your new jobs.
How are CPU points for a job calculated?
Available memory amount for a job
When you employ 64 CPU cores (ncpus=64), you can use ~512 GB for largemem jobs and ~128 GB for other jobs, respectively. If a large amount of memory is needed but not so many CPU cores, it is still necessary to employ many CPU cores.
How to add/remove members?
How to build/debug software in RCCS?
However, there are some performance limitations, which are intentionally employed, on those frontend nodes.
Please don't try production runs or long benchmark jobs on the frontend nodes.
CPU points will be used on ccfep* according to the actual CPU usage. (If you just login and do nothing, virtually CPU points won't be consumed.)
Interactive jobs are useful for test/build of GPU programs that can't be done on login server. However, interactive jobs are not available to all the users now. If you want to use interactive jobs, please ask us. Please note that there are some additional restrictions (walltime, number of CPUs/GPUs for example) in interactive jobs.
How to know limitation on CPU/GPU usage?
CPU/GPU limitation is basically determined from the initially assigned points. The actual values can be found in this table (right columns).
Current limitation can be shown by "jobinfo -s" command. (See "User/Group Stat" part in the beginning.)
Here are the example. Red- and blue-colored "****" parts ("RunLim" column) will show the max available CPUs/GPUs for you and your group, respectively.
[user@ccfep* ~]$ jobinfo -s
User/Group Stat:
--------------------------------------------------------------------------------
queue: H | user | group
--------------------------------------------------------------------------------
NJob (Run/Queue/Hold/RunLim) | x/ x/ x/- | x/ x/ x/****
CPUs (Run/Queue/Hold/RunLim) | x/ x/ x/**** | x/ x/ x/****
GPUs (Run/Queue/Hold/RunLim) | x/ x/ x/**** | x/ x/ x/****
core (Run/Queue/Hold/RunLim) | x/ x/ x/**** | x/ x/ x/****
lmem (Run/Queue/Hold/RunLim) | x/ x/ x/- | x/ x/ x/****
--------------------------------------------------------------------------------
note: "core" limit is for per-core assignment jobs (jobtype=core/gpu*)
note: "lem" limit is for jobtype=largemem
Queue Status (H):
(skipped)
There are two types of limitations; user and group limitations.
Per user limitation can be assigned by the representative of the group at resource limit page.
Cannot login via SSH / Login from outside Japan during Bussiness Trip
If you want to login from outside Japan, you need to submit an application form available in this page.
The applicants are basically required to have a valid academic affiliation in Japan and an account of RCCS; login of oversea collaborators to RCCS is generally not allowed.
We need extra CPU points / disk space
Please be careful about limitations.
How to change login shell?
It might take a time to untill the new setting becomes active. If the modification is not applied for more than a day, please contact us.
My jobs have been waiting for very long time
Major reasons why your jobs won't run are listed below. If you meet exceptional/unknown cases, feel free to contact us.
Not enough resources (CPU, GPU) available
You can check the status of available resources in this website. Once login to this site and go to the top page, you will find the status on the right column of the page. There can be enough space for other types of jobs. Switching jobtype (if it is possible) might be a good idea in this case.
My job won't run even though enough computational resources are available.
First, run jobinfo command to check the reason why your jobs won't run. You can find the reason at the rightmost column of the output. Major reasons are:
- (cpu), (gpu): not enough cpu/gpu available
- (group): due to the group limit
- (long): walltime too long (once a job began to run, that must be finished before the next maintenance)
- (other): scheduler of the system not yet tried to run your job (or other unknown reason)
- in some case of (other), "jobinfo -c" can't show the correct reason due to the insufficient information from the queuing system. Please try without "-c".
In case of (group), you or members of your group using large amount of resources. Please ask corresponding person to do something. FYI, representative person can limit the resource usage of each member at the resource limit page.
On the other hand, in case of (long), you may need to shorten the walltime of your job or wait until the end of next maintenance. (NOTE: queued jobs won't be deleted upon monthly maintenance.)
My jobtype=core jobs won't run even though enough computational resources are available.
Multinode jobs are not allowed for jobtype=core jobs; a job must reside on single node. Even when there are so many free CPU cores for jobtype=core in total, those free cores are often scattered among nodes and there may be not enough space for your large jobtype=core jobs. You might want to use 64 cores with jobtype=vnode in some cases."Cgroup mem limit exceeded" message
In case used memory amount exceeds the limit, the job would exit with the error message like below.
Cgroup mem limit exceeded: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=*******.ccpbs1,mems_allowed=*,oom_memcg=/pbs_jobs.service/jobid/*******.ccpbs1,task_memcg=/pbs_jobs.service/jobid/*******.ccpbs1,task=molpro.exe,pid=******,uid=******
[Wed Feb 1 **:**:** 2023] memory: usage ********kB, limit *********kB, failcnt ********
[Wed Feb 1 **:**:** 2023] memory+swap: usage *********kB, limit *****************kB, failcnt 0
[Wed Feb 1 **:**:** 2023] kmem: usage ******kB, limit *****************kB, failcnt 0
[Wed Feb 1 **:**:** 2023] Memory cgroup stats for /pbs_jobs.service/jobid/*******.ccpbs1:
You need to reduce amount of memory usage or use jobtype=largemem where large amount of memory available. For jobtype=core, please note that the amount of available memory is proportional to "ncpus" value in job header. In some case, you may need to specify ncpus=8 in the header, while you actually use only 4 cpu cores (e.g. mpirun -np 4).
In case memory is over-allocated but not used, the job would run normally even though "Cgroup mem limit exceeded" is shown in error file. Anyways, you should reduce the memory amount in your input file to avoid potential errors.
"No space left on device" error
You may have exceeded the size limit of local scratch space for the job (/lwork/users/(username)/(job id)). The size limit of local scratch is 11.9 GB * ncpus (where ncpus is available cores in that node; not a total number of ncpus of the job). Therefore, you can avoid those errors by increasing "ncpus" of your job. Or, you can use global scratch space (/gwork/users/(username)) instead of the local scratch. Please note that /gwork is huge but slower than /lwork.
For g16sub and g09sub, they use /lwork/users/(username)/(jobid) in the default setting. You can switch scratch space to /gwork by adding "-N" option.
Got an email titled "[RCCS] 障害発生と影響を受けたジョブに関するお知らせ"
In case computation node, network switch, or storage fails, this email would be sent to the owners of jobs which are affected by the trouble. (English version of this email is not yet available, sorry.)
(uid)様
2020年**月**日**時**分頃、****において
障害(****************)が発生しました。影響を受けたジョブの一覧をお知らせします。
--------+--------+--------+--------------+-----+------------------
Job ID Jobtype User Job Name #Core Treatment
--------+--------+--------+--------------+-----+------------------
(jobid) ***** (uid) (jobname) (num) XXX
--------+--------+--------+--------------+-----+------------------
In this email, you should check the action for that aborted job (Treatment column, colored with red in above sample).
- 異常終了(Abort) => Job is aborted (including the where a job failed to rerun). CPU points for that job will be paid back later.
- リラン(Rerun) => Job reran from the beginning. CPU points spent in the aborted run will be ignored.
Can't connect via sftp (including WinSCP etc.) with "Received message too long" error
- discard output to /dev/null
- allow output only if $PS1 exists (interactive)
- move the corresponding lines to .bash_profile
g16 immediately crashed with l1.exe: Permission denied
Once settings of multiple revisions of g16 are loaded, g16 job might die immediately.
Output example:
Entering Gaussian System, Link 0=/apl/gaussian/***/g16/._.g16
Initial command:
/apl/gaussian/****/g16/l1.exe "/gwork/users/***/********.ccpbs1/Gau-66921.inp" -scrdir="/gwork/users/***/********.ccpbs1/"
sh: /apl/gaussian/***/g16/l1.exe: Permission denied
Typically, this happens when configuration files are loaded both in the default setting files (such as .cshrc or .bashrc) and in the job script. You may need to modify either of default setting file or job script to solve this problem.
On the other hand, a single setting file can be loaded safely multiple times.
Also, you might (not always) be able to run Gaussian successfully after loading setting files of multiple Gaussian versions (g09 and g16). It is not recommended, though.
Cannot use formchk (Gaussian checkpoint file converter)?
csh (tcsh):
$ source /apl/gaussian/16c02/g16/bsd/g16.login
bash or zsh:
$ source /apl/gaussian/16c02/g16/bsd/g16.profile
(You can ignore "PYTHONPATH: Undefined variable." message. Setting is correctly loaded and formchk is available in this case.)
If you are using different version of Gaussian or queue, please replace the directory names above to the corresponding ones.
If you load one of gaussian modules (environment modules), corresponding version of formchk will be available.
formchk failed with insufficient memory error
Out-of-memory error in routine WrCIDn-* (IEnd= ******** MxCore= ***********)The memory amount of formchk can be modified via GAUSS_MEMDEF environment variable. Please set enough value according to the error message above, and then run formchk again.
Use %mem=***MW to provide the minimum amount of memory required to complete this step.
Error termination via Lnk1e at (***date***).
- csh:
setenv GAUSS_MEMDEF 800000000
- bash/zsh:
In this example, 800MW (=6.4GB) of memory is specified.export GAUSS_MEMDEF=800000000
Python environment (version, libraries) construction
$ pip install numpy --user
You can also use pyenv or/and miniforge. Anaconda is also a very good solution if there is no problem about license. If you are planning to use GPUs, we recommend to use conda environment such as miniforge. Runtime library for GPUs (cudatoolkit) can be installed easily by conda.
Miniforge environment prepared by RCCS is also available. For usage, please check links in python section of package program list page.
Intel Compilers (ifort, icc, icpc) not available
If you need them, please install Intel oneAPI Base Toolkit and HPC Toolkit into your directory.
https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html (Base Toolkit)
https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html (HPC Toolkit)
(These toolkits can be installed free of charge. However, there are restrictions on redistribution.)
Are there job script header samples?
You can find some explanations in our manual. The sample scripts for each applications (/apl/(name)/(version)/samples) may also be worth to check.
Request installation of software
Please fill the following items and send it to rccs-admin[at]ims.ac.jp (please replace [at] by @).
- Software name and version that you want to use
- Overview of the software and its feature
- Why that software is necessary in RCCS supercomputer
- URL of the official website
SSH Connection Error Messages
Permission denied (publickey)
Permission denied (publickey,hostbased).
Private and public keys don't match or the public key is not registered. This message may be shown if the private key is not correctly prepared or specified.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,hostbased).
This message will be shown if the private key is not correctly specified.
ssh:connect:Network is unreachable
ssh:connect:Host is unreachable
These messages will be shown if the network setting is incorrect or connection route to RCCS is not available due to some netowrk errors.
ssh:connect:Connection refused
ssh:connect:Connection timed out
RCCS service is temporary stopped for maintenance or some troubles. Or, the connection is refused accrording to the rules described below.
ssh_exchange_identification: Connection closed by remote host
Server unexpectedly closed network connection
Network failure happens or connection is refused according to the rules described below.
If the connection can't be establised successfully, adding verbose option to your ssh software will give you some useful information/messages. For example, if you are using "ssh" command on the terminal (maybe on Mac and Linux), verbose messages can be enabled by adding -v option.