Scaling of VASP 5 on IT4Innovations clusters – metallic structure

March 17, 2021 in Articles, HPC Blog, VASP

We wanted to see how well does VASP 5.4.4 scales on CPU Nodes at Salomon and Barbora supercomputers at IT4Innovations National supercomputing center in Ostrava, Czech republic.

Those clusters in question are both x86, Intel-based CPU clusters. Though Barbora has 8 nodes with 4x Tesla V100 each, those will not be a subject of this test. Salomon has 24 cores per node in two sockets, Barbora has 36 nodes per node in the same configuration.

For the test case, a 3x2x2 supercell of PtN2 fluorite was used, as this will give us reasonably large calculation to run the tests on as much as 64 nodes. The number of irreducible k-points is 45 and the number of bands is 786. Total of 10 iterations were ran for each of the tests and the average time of one iteration will be used to compare the results (grep LOOP OUTCAR - the "real time" value).

A number of different configurations of KPAR and NCORE parameters were used to see which one behaves the best with this type of calculation. It is however impractical, maybe even impossible, to run multiple tests for each calculation just to determine which configuration will yield the results in the shortest time. To mitigate this issue a "default strategy" will be defined. This rule of thumb is based on the previous experience and it seemed to have produced very good results in these tests as well.

The "default strategy" is following. KPAR=<number-of-CPU-sockets>, which in our case is 2 per each node. NCORE is then equal to the number of cores in each socket. In our case that's 12 on Salomon and 18 on Barbora. Equally, NPAR could be set to 1, but that would fall apart when the KPAR would change, so I find it more convenient to set the NCORE parameter instead. NSIM is equal to 4 as the NSIM parameter has been observed to have produced minimal speedup on non-accelerated nodes. The number of MPI processes per node is equal to the number of cores on that node and OMP_NUM_THEADS is 1.

Let's take a look at the results. The results in tabular form will be attached at the end of this post.

Note, how the default strategy and best cases start to differ at 32 nodes and more. This will be more visible in the following figures.

Here, we can see that both on Salomon and on Barbora, the scaling of the default strategy practically stops at above 32 nodes. The comment about the reason will be presented below these figures.

The scaling performance for given number of nodes is calculated relative to the performance of the smallest number of nodes (in this case 2) in given data set.

This last figure is another visualization of the scaling. Here, the scaling performance from the second figure is divided by the number of nodes on which the job ran. This gives an efficiency of the scaling relative to the lowest number of nodes (in our case 2). That's the reason why above 32 nodes, the efficiency remains non-zero. Different way to visualize this would be to compare the increase of performance to the increase of the number of nodes. In that case, we would see efficiency around 0 for default strategies above 32 nodes.

The reason why this drop in efficiency happens becomes obvious when looking at the tabular data at the end of this post. The default strategy is trying to push KPAR > 45, even though there is only 45 k-points in the calculation. VASP does not complain about this, but the inefficiency of doing so pretty much zeroes the effect of adding more nodes to the calculation.

However, if one makes a short run of VASP to see how many irreducible k-points there will be, the issue can be easily compensated for and the default strategy when KPAR would need to be larger that the number of k-points will change. On such number of nodes, it would be as following: increase the KPAR is much as possible without saturation of k-points and set NCORE to the number of cores on one node. With this addition the default strategy will get very close to the best cases (as seen in the tabular data e.g. for Salomon, Barbora behaves bit more unpredictably).

Another thing to bear in mind is to spread the load of k-points as equally as possible. This is not an issue when <number-of-k-points>/KPAR is high (let's say above 10). But when it's close to 1 or 2, one must be careful. If KPAR is for example 30 and the number of irreducible k-points is 45, then the first group of 30 k-points is computed in parallel. But the remaining 15 k-points can occupy only half of the cores, leaving the other half waiting. Ideally <number-of-k-points>/KPAR is a whole number, but that can break the default strategy and introduce an unwanted communication overhead.

Last but not least, from the default strategy is evident that as long as the ratio between the number of nodes and KPAR is constant, the memory requirements (per node) will stay more or less constant as well. So if one can fit a calculation into memory on 1 node with KPAR=1, theoretically they should be able to able to fit that into memory also on 32 nodes with KPAR=32. This of course assumes, that there is enough k-points (and it can break the load balancing).

It is also evident, that halving this ratio effectively halves the memory usage. So setting KPAR lower than the default strategy defines is necessary to fit larger jobs into memory.

Tabular results

Salomon CPU (best cases)

NODES	LOOP REAL	PS CPU	PS MEM	ITERS	KPAR	NCORE	NSIM	KPOINTS	ATOMS	NBANDS
2	896.27	98.0042	58.1	10	4	12	4	45	192	768
4	458.865	96.3542	59.9	10	8	12	4	45	192	768
8	241.563	93.0958	59.4	10	16	12	4	45	192	768
16	158.958	86.5625	59	10	32	12	4	45	192	768
32	71.7904	89.4583	31.2	10	32	24	4	45	192	768
48	50.0431	89.2792	21.5	10	32	24	4	45	192	768
64	39.202	82.6958	16.8	10	32	24	4	45	192	768

Salomon CPU (defualt strategy)

NODES	LOOP REAL	PS CPU	PS MEM	ITERS	KPAR	NCORE	NSIM	KPOINTS	ATOMS	NBANDS
2	896.27	98.0042	58.1	10	4	12	4	45	192	768
4	458.865	96.3542	59.9	10	8	12	4	45	192	768
8	241.563	93.0958	59.4	10	16	12	4	45	192	768
16	158.958	86.5625	59	10	32	12	4	45	192	768
32	82.6463	82.6792	59.2	10	64	12	4	45	192	768
48	87.1981	86.8958	59.4	10	96	12	4	45	192	768
64	80.8726	82.5125	59.3	10	128	12	4	45	192	768

Barbora CPU (best cases)

NODES	LOOP REAL	PS CPU	PS MEM	ITERS	KPAR	NCORE	NSIM	KPOINTS	ATOMS	NBANDS
2	349.713	96.1111	19.7	10	2	36	4	45	192	768
4	184.171	95.5028	39.6	10	8	18	4	45	192	768
8	92.3066	86.8194	39.6	10	16	18	4	45	192	768
16	49.2612	92.7528	21.6	10	16	18	4	45	192	768
32	33.7342	96.0639	39.6	10	64	18	4	45	192	768
48	19.069	94.3889	21.6	10	48	18	4	45	192	768
64	18.9109	93.675	21.6	10	64	18	4	45	192	768

Barbora CPU (defualt strategy)

NODES	LOOP REAL	PS CPU	PS MEM	ITERS	KPAR	NCORE	NSIM	KPOINTS	ATOMS	NBANDS
2	359.778	97.7361	39.6	10	4	18	4	45	192	768
4	184.171	95.5028	39.6	10	8	18	4	45	192	768
8	92.3066	86.8194	39.6	10	16	18	4	45	192	768
16	63.5415	91.0806	39.6	10	32	18	4	45	192	768
32	33.7342	96.0639	39.6	10	64	18	4	45	192	768
48	33.4336	91.5639	39.6	10	96	18	4	45	192	768
64	34.0071	93.775	39.6	10	128	18	4	45	192	768

LOOP REAL is average time per one iteration in seconds. PS CPU is the average CPU utilization in % from 10 samples at random times. PS MEM is the highest memory usage in % from 10 random samples. ITERS is the number of performed iterations.

Note the constant memory usage in both default strategies. This is because the ratio between number of nodes and KPAR is constant.