Nodes and Memory

Overview¶

Before submitting your job to the scheduler, it's important to know how many cores and memory your task requires and all this will be assigned based on the number of nodes you request.

Partition (Use `--partition`)¶

Wulver has three partitions, differing in GPUs or RAM available:

Partition	Nodes	Cores/Node	CPU	GPU	Memory
`--partition=general`	100	128	2.5G GHz AMD EPYC 7763 (2)	NA	512 GB
`--partition=debug`	1	4	2.5G GHz AMD EPYC 7763 (2)	NA	512 GB
`--partition=gpu`	25	128	2.0 GHz AMD EPYC 7713 (2)	NVIDIA A100 GPUs (4)	512 GB
`--partition=bigmem`	2	128	2.5G GHz AMD EPYC 7763 (2)	NA	2 TB

Priority (Use `--qos`)¶

Wulver has three levels of “priority”, utilized under SLURM as Quality of Service (QoS):

Qos	Purpose	Rules	Wall time limit (hours)	Valid Users
`--qos=standard`	Normal jobs. Faculty PIs are allocated 300,000 Service Units (SU) per year	SU charges based on node type (see SU, jobs can be preempted by high QoS enqueued jobs	72	Everyone
`--qos=low`	Free access, no SU charge	jobs can be preempted by high or standard QoS enqueued jobs	72	Everyone
`--qos=high_$PI`	Replace `$PI` with the UCID of PI, only available to owners/investors	Highest Priority Jobs, no SU Charges.	72	owner/investor PI Groups
`--qos=debug`	Intended for debugging and testing jobs	No SU Charges, maximum 4 CPUs are allowed, must be used with `--partition=debug`	8	Everyone

How many cores and memory do I need?¶

There is no deterministic method of finding the exact amount of memory needed by a job in advance. A general rule of thumb is to overestimate it slightly and then scale down based on previous runs. Significant overestimation, however, can lead to inefficiency of system resources and unnecessary expenditure of CPU time allocations.

We have tool seff in Slurm which you can use to check how much resources your job consumed and based on that re-adjust the configurations.

Understanding where your code is spending time or memory is key to efficient resource usage. Profiling helps you answer questions like:

Am I using too many CPU cores without benefit?
Is my job memory-bound or I/O-bound?
Are there inefficient loops, repeated operations, or unused computations?

Tips for optimization¶

Use multi-threaded or parallel libraries (OpenMP, MPI, NumPy with MKL).
Avoid unnecessary data copying or large in-memory objects.
Stream large files instead of loading entire datasets into memory.
Use job arrays for independent jobs instead of looping in one script.

Be careful about invalid configuration¶

Misconfigured job submissions can lead to job failures, wasted compute time, or inefficient resource usage.
Below are some common mistakes and conflicts to watch out for when submitting jobs to SLURM:

Asking for more CPUs, memory, or GPUs than any node in the cluster can offer. Job stays in pending state indefinitely with reason like ReqNodeNotAvail or Resources
Mismatch Between CPUs and Tasks. For example: Using --ntasks=4 and --cpus-per-task=8 but your script is single-threaded. You're blocking 32 cores but using only 1 effectively — leads to very low CPU efficiency.
Specifying walltime of more then 3 days is not allowed.
Submitting to a partition that doesn’t match your job type. For eg. Requesting a GPU with a non-GPU partition: --partition=standard --gres=gpu:1. Job will fail immediately or be held forever.