Once you have gotten access to a supercomputer, you will want to run your model code on it as soon as possible! Below, I’ve listed a few steps/issues I’ve run into when setting up my NEURON code on a new supercomputer. The first two (installing NEURON, writing a job submission script) can be avoided by using the Neuroscience Gateway portal, aka the NSG. The NSG is a great option for getting started with large scale, parallel NEURON modeling – you can get a free account with some computing time right away, and it is well supported.
If you decide not to use the NSG, or if you like it but decide you want to scale up your project even more, then you will have additional considerations for using NEURON on another supercomputer. Here are some solutions that may be useful for you:
Install NEURON on the supercomputer
Note: if you want to use NEURON on San Diego Supercomputing Center’s Trestles computer, this is the same computer on which the NSG runs, so you can just use their installation of NEURON.
Different computers have different configurations, and so you may need to customize the process for each one. But generally, once you have access to a supercomputer, follow these steps:
1. Check with the system administrator to see whether they prefer to install NEURON for you as a module. Then, you simply need to load the module when you log into the supercomputer, and you can skip the rest of these steps.
2. If you will have to install NEURON yourself, first ensure that you have a compiler module (ex: gnu, intel) and an MPI module (ex: mvapich2) loaded. After logging into the supercomputer, enter the following command to see which modules are currently loaded on your account:
You can also enter the following command to see what modules are available on the system:
You can then load modules using:
module load MODULE_NAME
If you have trouble compiling NEURON, you can try switching to a different compiler module before attempting to install again:
module unload gnu
module load intel
Once your environment is ready, the real fun begins. You will need to download the NEURON software onto the supercomputer, compile it, add the executables to your PATH variable, and test the installation. I’ve written up step-by-step instructions for how to do this on several different supercomputers. The instructions are available on the NEURON forum:
- Trestles supercomputer at SDSC
- Stampede supercomputer at University of Texas
- Ranger supercomputer at University of Texas (retired)
Even if the supercomputer you will use is not listed here, one of these installation instructions may generally work for your machine. If so, feel free to copy, modify, and post a new NEURON forum post about installing NEURON on that supercomputer.
Write a job submission script to submit your NEURON simulation run to the batch queue
When you run NEURON code on a supercomputer, you are sharing that computer with thousands of other users. Your simulation job will essentially need to get in a line and wait its turn to run. Supercomputers use batch queuing software (aka batch scheduling software) to manage this line. Basically, you tell the software how many processors your job wants to use, and for how long. Then the software fits everyone’s requests together like puzzle pieces to ensure that all the jobs get through the line as efficiently as possible.
The job submission script is how you tell the batch queuing software the number of processors you need and for how long. The syntax of the script will vary slightly depending on what batch queuing software the computer uses. But it will always contain these components:
- The name of your run: give your runs unique, descriptive names so you can tell them apart easily. You can give families of runs similar names (ex: ParameterSweep_01, ParameterSweep_02…)
- Where to write the log file(s): standard output and standard error can be written to specific files. You should absolutely have a good process for storing these log files, as they will be crucial in troubleshooting any failed runs you may have. Most supercomputers have shortcut symbols you can put into your job submission file so that the log files include your unique run name in their name. I highly recommend using your job run name in the log file name to make the troubleshooting process easier. Even when most of the bugs are ironed out, the log file is useful for checking how far along your job is (assuming you have written your code so that it sends occasional messages to standard output using “print” commands).
- Which queue or line your run should wait in: many supercomputers have separate lines or queues for development runs (very short runs using very few processors), very long runs (perhaps over 24 hours), runs requiring special large-memory processors, large runs requiring more processors than usual, or “normal” runs.
- How many processors the run will need: this may also be specified in terms of nodes (groups of processors). Note that I use “processor” and “core” synonymously here.
- How much time you want to reserve the processors for: specify a “hard” time requirement – the batch scheduling software will use this time limit to fit together its puzzle, i.e., to determine when your job will run. This number is also a time limit, in that if your code has not already finished by the time this limit is reached, the supercomputer will terminate your code at the time limit. If your code hasn’t already finished and is therefore terminated by the supercomputer, you may lose all of your results! How to set a reasonable time limit and avoid losing results is covered below.
- The NEURON execution command: this is the command that will actually launch the NEURON software and your NEURON code. You may need to add in special command options here to get your code to run correctly. If you have to troubleshoot getting a correct NEURON installation to run on the supercomputer, the system administrator will likely have you edit this execution command.
The installation links given in the first section above also contain sample job submission scripts that you can use. In addition, supercomputers almost always have online user guides that fully document all the options you can specify in the job submission script and give example submission scripts.
Read on for help determining how many processors to request and how to set your hard time limit.
Determine how many processors your simulation job needs
Different simulation configurations have different requirements, and your own priorities (i.e., executing runs as quickly as possible, in terms of your time v. executing runs in a way that minimizes supercomputing time or that minimizes the time your job waits in line) will dictate how many processors you want. But one constraint that is not subjective is how much RAM (memory) each node (group of processors) has for your program. If your program exceeds the available memory, it will crash.
Therefore, the first step is to figure out how much memory your program needs. Here are a couple of strategies that may work:
- For my code, the memory requirement just about reaches its maximum after my model has formed all its connections (synapses); if it can successfully pass that step, it’s almost certainly going to have enough memory to execute the simulation and write the results. Since that step generally happens within a few seconds of starting the code, it requires a trivial amount of supercomputing time to run several runs that only simulate 1-2 ms of time, while varying the number of processors I use each run. I can find a number of processors that is sufficient to set up the model and all its connections without crashing. Then I can execute a much longer run that requests that number of processors.
- If you don’t want to submit multiple large runs of a few minutes each, another option is to guess your memory requirements from a smaller model version. In general, any parallel code that you write should be scaleable. Not only should it be flexible enough to work on different numbers of processors, but you should also write it in such a way that you can run different model sizes very easily. If you have written your code this way, then you can execute a small run that you are certain will have enough memory to complete. You can monitor the memory usage of this run during its execution, for example by writing out top commands periodically:
strdef myTopOutput system("top -p `pgrep special | head -n20 | tr "\n" "," | sed 's/,$//'` -n1 -b | tail -n2 | head -n1",myTopOutput) // this will execute for each processor unless you tell it to only execute for one // then print myTopOutput to file or standard output, along with what part of the code you are in. // note, whether you search for a process named "special" or "nrniv" will depend on how neuron was installed.
or by using the nrn_mallinfo function (a wrapper for the linux mallinfo function, which works on some supercomputers but not others):
// adapted from Michael Hines strdef descriptionString descriptionString = "where I am in the code right now" m = nrn_mallinfo(0) // see mallinfo documentation for info about arguments to nrn_mallinfo printf("Memory - %s: %ldn", descriptionString, m) // this will execute for each processor unless you tell it to only execute for one
Then, you can extrapolate how much memory will be required by your full scale run. When extrapolating, keep in mind whether your memory requirements are going to scale linearly (ex: number of cells and synapses is proportional to the scale) or more steeply (ex: number of synapses is proportional to the square of the number of cells or the scale).
Note: some supercomputers will let you reserve a certain number of nodes (and will charge you compute time for the entire nodes) but let you specify to use only a few processors per node; this can allow you to use more memory per processor if needed.
Determine how much time your job needs to run
As with determining memory requirements, I find that it’s easiest for me to execute a short simulation first to get a rough idea of the time needed. After determining the scale at which I will run my model and the number of processors I will use, I can then execute a short (10-50 ms) run and see how long that took to execute. Then I can extrapolate that time to get an estimate of how long the full simulation length will take. To be even more precise, make sure you monitor the different phases of your code (setup, creating cells, forming connections, running simulation, writing results), so you can extrapolate the times for the phases that increase with a longer simulation. Obviously the “running simulation” part will increase roughly linearly with simulation time, but the “writing results” phase will also increase, perhaps linearly as well (in both cases, we are assuming a constant model size).
Even after determining my time estimate, I always pad the hard time requirement by 25-50% the first time I run the model for a long time, and then I still pad it by about 10% even after I am confident of the time required.
Another safety net I recommend is to write your NEURON code so that it is sensitive to how much time it has taken, and how much is left. This means, whatever time you set for your hard time limit in the job submission script, you should pass that amount of time into your NEURON code as a parameter. Use “startsw()” in your NEURON code to track how much time it is taking. Then, during the simulation phase, execute your simulation in chunks of time (say 50-100 ms each). Time how long it takes for each chunk to complete. Then after each chunk, check how much time you have remaining and calculate whether there is enough time to complete another chunk of simulation and write out all the results. If there is not, stop the simulation immediately and write out the results right then. Alternatively, you could write out the results from each chunk of the simulation immediately after it completes. Either of these strategies will prevent your code from being suddenly terminated at the end of the hard time limit and losing all your results (and the supercomputing time spent) due to setting a time limit that was too short.
Here is a sample code file for executing the simulation in chunks and checking after each chunk whether you have time to complete another one.
Check the status of your submitted jobs
Finally, you will want to (and should) monitor your jobs after you submit them to the supercomputer. There are a variety of ways to do this:
- Set your email notification options in the job submission script: you can tell the job whether to email you at each milestone (job begins running, job finishes running, job fails or is terminated, etc.)
- Use the status commands associated with the batch queuing software: log into the supercomputer and run a command to see whether your job has begun (commands such as “qstat”, “squeue”, and “showq” work depending on which batch queuing software you are working with). The job status will be reported, usually pending or running (though completed and failed jobs briefly remain in the queue and will report their status back if you send a command during that time).
- Check the log file(s): you should be writing the standard output and standard error generated from your job to file (either separate files or merged into one). After your job begins executing, check the log file periodically using the “cat” or “tail” commands to check that the run is still executing successfully (ie, that it isn’t hanging and hasn’t errored without quitting) and is writing out any print statements you added to your code.
- Look for interim results: if your code is supposed to write out result files early on in the simulation, check to see if they have been generated yet.
Checking the status of your jobs is extremely important at the beginning of your project and after you make updates to the code, when you will want to ensure that your jobs are not failing or otherwise stalling and eating up compute time (and your time) without producing results.