Memory usage and good citizenship

On a shared machine such as the JHPCE cluster, it is important for everyone to use the resources wisely to avoid congestion and to maximize resources for everyone. Please be a good citizen and use memory wisely. Here are three recommendations.

Recommendations

1) Set your mem_free so it is as close as possible to your expected actual usage (your actual usage divided by the number of slots should be slightly lower than your mem_free). Avoid “playing it safe” and asking for substantially more memory than you actually need.

2) Use the commands described here to observe memory usage in running jobs or to get memory usage in completed or aborted jobs. We understand that sometimes it is not possible to know how much memory a job will use and in these cases it is certainly reasonable to waste memory, but submitting a large number of such jobs is not good citizenship.

We are asking users to be more careful with their memory requests because an analysis of jobs run during the three months starting in January revealed that, on average, the memory allocation rate is 81% of capacity (a good number), but the ACTUAL memory usage hovers around 23% (terrible!). This means that too many users are requesting far more memory than they need. This causes the grid engine to think that the cluster is busier than it actually is, which in turn causes it to unnecessarily restrict the number of jobs that it starts on the cluster.

We also determined that  some users are using much more memory than they request.  This can actually lead to node crashes. To prevent this we recommend that you:

3) Set your high-water mark (h_vmem)  so it matches your mem_free.  Using more memory than is requested is bad citizenship because it can lead to a situation where the scheduler sends too many jobs to a node which, in the worse case, results in a crash. This used to be a common occurence before we made memory a consumable resource.

Please be good citizens. Big Brother is watching …

Results of analysis

The table below shows the amount of memory (in TB-days) wasted  between Jan-Mar 2017 by users who requested more memory than they actually used. The results are ordered by user. We anonymized the users.

+-----------+----------------+---------+
| anonymized| wasted usage   | N(jobs) |
|  userid   |  (TB-days)     |         |
+-----------+----------------+---------+
| 329055714 | 54             | 234     |
| 329082844 | 28             | 880     |
| 329113510 | 21             | 23      |
| 329062952 | 19             | 1949    |
| 329055899 | 19             | 12496   |
| 329055645 | 17             | 96184   |
| 329056431 | 14             | 201     |
| 330857730 | 13             | 114     |
| 329055782 | 7              | 159919  |
| 329141862 | 7              | 255     |
| 329056101 | 7              | 2659    |
| 329055698 | 6              | 50425   |
| 329075706 | 6              | 225     |
| 329055637 | 6              | 521889  |
| 329056935 | 5              | 6298    |
| 329055992 | 4              | 22619   |
| 329063154 | 4              | 1403    |
| 330887458 | 4              | 11      |
| 329070410 | 4              | 405     |
| 329056652 | 4              | 2245    |
+-----------+----------------+---------+

Below are users who set their high water mark so high that their jobs consumed much more memory than they requested.. The scheduler does not know when a job is using more memory than is requested and so it can unknowingly overbook the  memory  on a host. This can negatively impact performance and (in the worst case) cause a host to crash. In effect these users are stealing memory by allocating more memory than they requested. Between Jan-Mar 2017 there were only 6 users who systematically did this.  The amount of  stolen memory is shown in the table below.

+-----------+----------------+---------+
| uid       | stolen TB-days | N(jobs) |
+-----------+----------------+---------+
| 329060039 | 8.192          | 308     |
| 329055678 | 0.434          | 257045  |
| 329056939 | 0.396          | 709     |
| 329351072 | 0.124          | 11      |
| 329082616 | 0.074          | 15      |
| 329058516 | 0.008          | 64      |
+-----------+----------------+---------+

 

This entry was posted in JHPCE Announcements. Bookmark the permalink.