Nodes and Memory
Overview¶
Before submitting your job to the scheduler, it's important to know how many cores and memory your task requires and all this will be assigned based on the number of nodes you request.
Partition (Use --partition
)¶
Wulver has three partitions, differing in GPUs or RAM available:
Partition | Nodes | Cores/Node | CPU | GPU | Memory |
---|---|---|---|---|---|
--partition=general |
100 | 128 | 2.5G GHz AMD EPYC 7763 (2) | NA | 512 GB |
--partition=debug |
1 | 4 | 2.5G GHz AMD EPYC 7763 (2) | NA | 512 GB |
--partition=gpu |
25 | 128 | 2.0 GHz AMD EPYC 7713 (2) | NVIDIA A100 GPUs (4) | 512 GB |
--partition=bigmem |
2 | 128 | 2.5G GHz AMD EPYC 7763 (2) | NA | 2 TB |
Priority (Use --qos
)¶
Wulver has three levels of “priority”, utilized under SLURM as Quality of Service (QoS):
Qos | Purpose | Rules | Wall time limit (hours) | Valid Users |
---|---|---|---|---|
--qos=standard |
Normal jobs. Faculty PIs are allocated 300,000 Service Units (SU) per year | SU charges based on node type (see SU, jobs can be preempted by high QoS enqueued jobs | 72 | Everyone |
--qos=low |
Free access, no SU charge | jobs can be preempted by high or standard QoS enqueued jobs | 72 | Everyone |
--qos=high_$PI |
Replace $PI with the UCID of PI, only available to owners/investors |
Highest Priority Jobs, no SU Charges. | 72 | owner/investor PI Groups |
--qos=debug |
Intended for debugging and testing jobs | No SU Charges, maximum 4 CPUs are allowed, must be used with --partition=debug |
8 | Everyone |
How many cores and memory do I need?¶
There is no deterministic method of finding the exact amount of memory needed by a job in advance. A general rule of thumb is to overestimate it slightly and then scale down based on previous runs. Significant overestimation, however, can lead to inefficiency of system resources and unnecessary expenditure of CPU time allocations.
We have tool seff
in Slurm which you can use to check how much resources your job consumed and based on that re-adjust the configurations.
Understanding where your code is spending time or memory is key to efficient resource usage. Profiling helps you answer questions like:
- Am I using too many CPU cores without benefit?
- Is my job memory-bound or I/O-bound?
- Are there inefficient loops, repeated operations, or unused computations?
Tips for optimization¶
- Use multi-threaded or parallel libraries (OpenMP, MPI, NumPy with MKL).
- Avoid unnecessary data copying or large in-memory objects.
- Stream large files instead of loading entire datasets into memory.
- Use job arrays for independent jobs instead of looping in one script.
Be careful about invalid configuration¶
Misconfigured job submissions can lead to job failures, wasted compute time, or inefficient resource usage.
Below are some common mistakes and conflicts to watch out for when submitting jobs to SLURM:
- Asking for more CPUs, memory, or GPUs than any node in the cluster can offer. Job stays in pending state indefinitely with reason like
ReqNodeNotAvail or Resources
- Mismatch Between CPUs and Tasks. For example: Using
--ntasks=4
and--cpus-per-task=8
but your script is single-threaded. You're blocking 32 cores but using only 1 effectively — leads to very low CPU efficiency. - Specifying walltime of more then 3 days is not allowed.
- Submitting to a partition that doesn’t match your job type. For eg. Requesting a GPU with a non-GPU partition:
--partition=standard --gres=gpu:1
. Job will fail immediately or be held forever.