Overview¶

Wulver is a shared resource among researchers, faculties and students. It's very important to use it efficiently so that every one can complete their task without delays. Therefore, running jobs on Wulver, you should follow certain norms which ensures that your work is done on time and also lets others run their task without any conflict. We use Slurm on Wulver to schedule and manage jobs.

What is SLURM?¶

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduler designed for high-performance computing clusters. It is widely used in research, academia, and industry to efficiently manage and allocate computing resources such as CPUs, GPUs, memory, and storage for running various types of jobs and tasks. Slurm helps optimize resource utilization, minimizes job conflicts, and provides a flexible framework for distributing workloads across a cluster of machines. It offers features like job prioritization, fair sharing of resources, job dependencies, and real-time monitoring, making it an essential tool for orchestrating complex computational workflows in diverse fields.

Some of best practices to follow:¶

Request Only the Resources You Need :
Be precise when requesting CPUs, memory, GPUs, and runtime.
Avoid overestimating job time (--time) and memory (--mem) as it reduces scheduler efficiency.
Use monitoring tools to understand your typical usage patterns and adjust accordingly.
Do not run jobs on Login Node :
Login node is entry point in Wulver and have limited memory and resources.
Please avoid directly running jobs in login node as it an slow down the system for everyone, cause timeouts or forced termination of your session, disrupt file I/O or job submissions for other users and many more problems.
Always submit jobs to compute nodes via slurm script or start an interactive session.
Use Appropriate Partitions :
Submit jobs to the correct partition based on resource needs (e.g., GPU, high-memory).
Avoid using the default or high-priority partitions unless necessary.
Test and Debug with Small Jobs First :
Use short test runs or dedicated debug partitions for code testing or troubleshooting.
This helps prevent long-running failures and wasted compute hours.
Monitor Your Jobs :
Please use commands like squeue, slurm_jobid $jobid, seff $jobid to check your job stats
You can also use our Ondemand Tools.
Respect Fair Usage Policies :
Do not monopolize shared resources by submitting excessive large jobs or hogging GPUs.
Be mindful of Wulver's usage policy.
Leverage MIGs for Efficient GPU Utilization :
Our Nvidia A100 GPUs have MIG implementation which allows a single GPU to be split into multiple isolated instances.
Use MIG-compatible partitions when your task doesn’t require the full GPU power
More info about MIG.