You are here

FAQs

Frequently asked questions about usage of computers.

FAQs: 

What will happen if the resource usage exceeds the limit

New jobs won't be accepted. Please be careful not only on the CPU usage but also on the disk usage.
In case group CPU points run out, jobs of that group will be removed after 24+ hours of that points exceedance.

How to build/debug software in RCCS?

We don't have interactive or debug queue.
You can build and test software on the frontend nodes (ccfep*, ccgpup, ccgpuv), where parallel make (e.g. make -j 12) or MPI program run (e.g. mpirun -np 12 ...) is allowed.

However, there are some performance limitations, which are intentionally employed, on those frontend nodes.
Please don't try production runs or long benchmark jobs on the frontend nodes.
CPU points will be used on ccfep* and ccgpuv according to the actual CPU usage. (If you just login and do nothing, virtually CPU points won't be consumed.)

To login to GPU equipped frontend nodes (ccgpup or ccgpuv), you first login to ccfep* and then "ssh" to ccgpup/ccgpuv.
No additional settings are required if you can login to ccfep*.

How to know limitation on CPU/GPU usage?

CPU/GPU limitation is basically determined from the initially assigned points. The actual values can be found in this table (right columns).

Current limitation can be shown by "jobinfo -q PN -c" command (do not add -l options in this case).
Here are the example. Blue- and red-colored "****" parts in "RunLim" row will show the max available CPUs/GPUs for you and your group, respectively.

[user@ccfep* ~]$ jobinfo -q PN -c
-----+------+---------------------------+-------------------------
Queue|Job   |User(***)     Group(***)   |Total                    
Name |Status|CPUs,GPU/Jobs CPUs,GPU/Jobs|  CPUs,GPUs/Jobs(Usr,Grp)
-----+------+---------------------------+-------------------------
PN   |RunLim|****, **/-    ****, **/-   | xxxxx, xxx/-   (  -,  -)
     |Run   |   x,  x/x       x,  x/x   | xxxxx, xxx/xxxx(xxx,xxx)
     |Queue |   x,  x/x       x,  x/x   | xxxxx, xxx/xxxx(xxx,xxx)
     |Hold  |   x,  x/x       x,  x/x   | xxxxx, xxx/xxxx(xxx,xxx)
     |Exit  |   x,  x/x       x,  x/x   |     x,   x/x   (  x,  x)
     |Total |   x,  x/x       x,  x/x   |xxxxxx,xxxx/xxxx(xxx,xxx)
-----+------+---------------------------+-------------------------

There are two types of limitations; user and group limitations.
Per user limitation can be assigned by the representative of the group at resource limit page.

Cannot login via SSH / Login from outside Japan during Bussiness Trip

If you have trouble in accessing from Japan, please check this page (need to login).

If you want to login from outside Japan, you need to submit an application form available in this page.
The applicants are basically required to have a valid academic affiliation in Japan and an account of RCCS; login of oversea collaborators to RCCS is generally not allowed.
 

How to change login shell?

You can use csh (tcsh), bash, or zsh as a login shell. This setting can be modified only via the center web page. To modify the setting, click "My account" item in the left pane first. (That item won't be shown if you are not logged in.) Then, click "Edit" tab in that page. You can choose your login shell using the combobox in that page. After the choice of the login shell, make sure to click "Save" button located below to save the modified setting.

It might take a time to untill the new setting becomes active. If the modification is not applied for more than a day, please contact us.
 

My jobs are waiting for very long time

Major reasons why your jobs won't run are listed below. If you meet exceptional/unknown cases, feel free to contact us.

Not enough resources (CPU, GPU) available

You can check the status of available resources in this website. Once login to this site and go to the top page, you will find the status on the right column of the page. There can be enough space for other types of jobs. Switching jobtype (if it is possible) might be a good idea in this case.
 

My job won't run even though enough computational resources are available.

First, run jobinfo command with "-l" option to check the reason why your jobs won't run. You can find the reason at the rightmost column of the output. Major reasons are:

  • (cpu), (gpu): not enough cpu/gpu available
  • (group): due to the group limit
  • (long): walltime too long (once a job began to run, that must be finished before the next maintenance)

In case of (group), you or members of your group using large amount of resources. Please ask corresponding person to do something. FYI, representative person can limit the resource usage of each member at the resource limit page.

On the other hand, in case of (long), you may need to shorten the walltime of your job or wait until the end of next maintenance. (NOTE: queued jobs won't be deleted upon monthly maintenance.)

My jobtype=core jobs won't run even though enough computational resources are available.

Multinode jobs are not allowed for jobtype=core jobs; a job must reside on single node. Even when there are so many free CPU cores for jobtype=core in total, those free cores are often scattered among nodes and there may be not enough space for your large jobtype=core jobs. You might want to use 40 cores with jobtype=small in some cases.

Got an email titled "[RCCS] 障害発生と影響を受けたジョブに関するお知らせ"

In case computation node, network switch, or storage fails, this email would be sent to the owners of jobs which are affected by the trouble. (English version of this email is not yet available, sorry.)

(uid)様

2020年**月**日**時**分頃、****において
障害(****************)が発生しました。

影響を受けたジョブの一覧をお知らせします。
--------+------+--------+--------------+-----+------------------
Job ID   Queue  User     Job Name       #Core Treatment
--------+------+--------+--------------+-----+------------------
 (jobid) PN     (uid)    (jobname)       (num) Exit_status=XXX
--------+------+--------+--------------+-----+------------------

In this email, there is an exit status of the job (colored red). The affected jobs would rerun automatically if the value is negative. Although the exit status is expected to be negative in case of system troubles, non-negative value might be returned. This is because the job system occasionally can't recognize the state of the failed system correctly. In this case, the job is simply finished and not rerun. You need to submit the job again. CPU points of the jobs until the system failure would be fully paid back regardless of the exit status.

In case of automatic rerun (negative exit status), some of jobs would fail on second run due to the intermiediate files from previous run. You can avoid those kinds of failures by correctly removing the intermiediate files of previous runs before running main program. Or, you can add "#PBS -r n" in the job script to suppress automatic rerun of the job.

g16 immediately crashed with l1.exe: Permission denied

Once settings of multiple revisions of g16 are loaded, g16 job might die immediately.
Output example:

 Entering Gaussian System, Link 0=/local/apl/lx/g16????/g16/._.g16
 Initial command:
 /local/apl/lx/g16****/g16/l1.exe "/work/users/***/********.cccms1/Gau-66921.inp" -scrdir="/work/users/***/********.cccms1/"
sh: /local/apl/lx/g16***/g16/l1.exe: Permission denied

Typically, this happens when configuration files are loaded both in the default setting files (such as .cshrc or .bashrc) and in the job script. You may need to modify either of default setting file or job script to solve this problem.

On the other hand, a single setting file can be loaded safely multiple times.
Also, you might (not always) be able to run Gaussian successfully after loading setting files of multiple Gaussian verisons (g09 and g16). It is not recommended, though.
 

Gaussian Freq calculations constantly crashes due to a memory shortage

There are some reports that Gaussian "Freq" calculations crash due to the memory shortage even though the available memory size is increased. In these case, *reduction* of available memory size could solve the problem, according to the reports. If you have tourble with this issue, this workaround maybe worth trying.
 

Cannot use formchk (Gaussian checkpoint file converter)?

You need to load setting file before running formchk. If you want to convert checkpoint file from g16 output from LX queue, you need to run the following command. Please note that the setting file name depends on your login shell.

csh (tcsh):

$ source /local/apl/lx/g16/g16/bsd/g16.login

bash or zsh:

$ source /local/apl/lx/g16/g16/bsd/g16.profile

If you are using different version of Gaussian or queue, please replace the directory names above to the corresponding ones.
If you load one of gaussian modules (environment modules), corresponding version of formchk will be available.
 

Python environment (version, libraries) construction

If the requested version of python or the library is provided by the distribution, we can install them. However, we recommend you to build Anaconda environment in your home directory, because Anaconda covers wider area of the python applications/libraries used in academic science than the package set provided by the distribution.

We also prepared some Anaconda environments. Links to their usage page can be found at package program list page.
 

Modifications to files ignored? / Not removable, not controllable file?

Modifications to files ignored?

In some rare cases upon system troubles, modifications to files would be ignored at first glance.
This is generally caused by the mishandled file cache.
If you met this unlucky event, please check it carefully according to the following procedure.

  • login to other hosts (ccfepX; X=1-8)
  • get checksum of the file (e.g. md5sum) or look directly the file (e.g. less)
  • if the checksums of a file are different or the contents of the file is different between hosts, there maybe a cache problem.

If you find a suspicous file, please report the hostname and the file location to us: rccs-admin[at]ims.ac.jp (please replace [at] by @).
We will clear the file cache of that host. (This is because normal users cannot do this operation.)

You can fix it by yourself by file manipulation (not recommended, though).
If you try to fix, please make backup before the file manipulation.
 

Not removable, not controllable file?

On the other hand, files can be broken due to a storage trouble. "ls -l" might return the following type of output in some case. (This might happen upon file creation stage. Already existing file would not experience this type of trouble.)

ls: cannot access (file name): No such file or directory
-????????? ? ?      ?        ?       ?            (file name)

Unfortunately, this file is completely broken. All you can do for this file is just to remove.
You may need to employ "unlink" command instead of standard "rm" command to remove this file.
 

Are there job script header samples?

We prepared some samples in this page.
You can find some explanations in our manual. The sample scripts for each applications (/local/apl/lx/(name)/samples) may also be worth to check.

Request installation of software

Please fill the following items and send it to rccs-admin[at]ims.ac.jp (please replace [at] by @).

  • Software name and version that you want to use
  • Overview of the software and its feature
  • Necessity of installation to supercomputers in RCCS
  • URL of the software development

SSH Connection Error Messages

Permission denied (publickey)

Permission denied (publickey,hostbased).

Private and public keys don't match or the public key is not registered. This message may be shown if the private key is not correctly prepared or specified.

Permission denied (publickey,gssapi-keyex,gssapi-with-mic,hostbased).

This message will be shown if the private key is not correctly specified.

ssh:connect:Network is unreachable

ssh:connect:Host is unreachable

These messages will be shown if the network setting is incorrect or connection route to RCCS is not available due to some netowrk errors.

ssh:connect:Connection refused

ssh:connect:Connection timed out

RCCS service is temporary stopped for maintenance or some troubles. Or, the connection is refused accrording to the rules described below.

ssh_exchange_identification: Connection closed by remote host
Server unexpectedly closed network connection

Network failure happens or connection is refused according to the rules described below.

If the connection can't be establised successfully, adding verbose option to your ssh software will give you some useful information/messages. For example, if you are using "ssh" command on the terminal (maybe on Mac and Linux), verbose messages can be enabled by adding -v option.