COPYRIGHT (c) 1993-1996 Stratus Computer, Inc. All Rights Reserved. This document is the property of Stratus Computer, Inc. Permission is granted to use this document. Permission is granted to redistribute this document as long as this notice is intact, no fee is charged, and all original files are distributed intact. THIS INFORMATION IS PROVIDED ON AN "AS IS" BASIS WITHOUT WARRANTY OR SUPPORT OF ANY KIND. STRATUS SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT SHALL STRATUS COMPUTER OR ITS SUPPLIERS BE LIABLE FOR ANY DAMAGES WHATSOEVER INCLUDING DIRECT, INDIRECT, INCIDENTAL, CONSEQUENTIAL, LOSS OF BUSINESS PROFITS OR SPECIAL DAMAGES, EVEN IF STRATUS COMPUTER OR ITS SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. SOME STATES DO NOT ALLOW THE EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES SO THE FOREGOING MAY NOT APPLY. THIS DISCLAIMER APPLIES DESPITE ANY VERBAL REPRESENTATIONS OF ANY KIND PROVIDED BY ANY STRATUS EMPLOYEE OR REPRESENTATIVE. :Advice on Porting to Atlantic Systems Page &page_number& Advice on Porting to Atlantic Systems Paul Green July 1, 1993 Revised September 20, 1996 1. INTRODUCTION Customers porting existing software to Stratus RISC Multiprocessor systems (Model XA/R 310, 320, and 330, code named Atlantic) need to be aware of methods to optimize their software to achieve maximum performance on these new systems. Customers writing new software for these systems may be able to achieve better performance from the beginning by being aware of the same methods. As always, the overall application design, its algorithms, data types, file access methods are the most important factors to consider in order to meet the performance goals of a product. The steps outlined hereafter should be used in conjunction with the normal procedures for designing, implementing and tuning high-performance software. You are encouraged to read Stratus Technical Report TR-3, VOS System Performance Analysis for advice on this phase. The usual disclaimers apply; not all of the recommendations given here apply to all situations, and the exact performance of an application is dependent on many factors, only some of which are covered in this memo. This memo is written from the point of view of a VOS user, but many of the recommendations apply to FTX users as well. If a recommendation is specific to VOS, it is so noted. 2. RECOMMENDATIONS These recommendations are in order of increasing difficulty of implementation. 2.1. Get the number of alignment faults down to a few per second. Why? The cost of processing an alignment fault is simply too expensive. The tools necessary to find alignment faults are documented in the file >system>doc>r11.5_addendum_srb.memo. There are two additional files which contain general information regarding alignments faults. The files may be located as follows: >system>doc>os_r9_srb.memo >system>doc>os_r11_srb.memo The default methods of compiling will generate code that lays data structures out compatibly with the CISC processors, but generates machine instructions that require RISC alignment. This method generates alignment faults when the data is not on a RISC boundary. We have seen applications that take thousands of alignment faults per second. There are two techniques for eliminating alignment faults. The best way is to re-layout structures using the longmap attribute and the longmap/check compiler option. An easier method is to leave the data structures alone and use the shortmap compiler option. Note that compiling for shortmap can greatly increase the size of the generated machine code, so this technique should be used only as an interim step towards full longmapped data structures. (VOS only. FTX does not support alignment faults. Any alignment errors are fatal). 2.2. Do not attempt to run the processors at 100% utilization. Why? The scaling degrades as the processors in a multiprocessing system compete with each other for access to the main system bus. Keep the processor utilization below 80% or so. (See the "CPU minutes" percent value reported by display_system_usage.) 2.3. Compile all code for the highest optimization level. Do not ship production code with full symbol tables. Why? Compilers have been extensively improved over the last few years and are capable of performing many state-of-the-art optimizations on your programs. These optimizations are crucial to getting the best performance out of RISC-based systems. If you do not use the maximum optimization level you are not getting the benefit of the compiler's optimizations. If you use full symbol tables you are getting no optimizations, as this option overrides the optimization level option. The highest optimization level on VOS is 4. The -table option produces full symbol tables; use -production_table to get tables and (some) optimizations; omit both -table and -production_table and use -optimization_level 4 to get the best optimization. See the languages doc files, SRB, or manuals for more detail. Similar comments apply to FTX. 2.4. Minimize sharing of data between processors. Watch out for heavily-used global (shared vm) variables that are updated by every transaction. Statistics/accounting/timing variables count just as much as program logic variables or data. Eliminate unnecessary sharing. Make variables per-processor if possible. Be aware of cache line boundaries. The data you make per-processor *must* be in different cache lines or you will get false sharing. Why? The cost of a cache miss is about an order of magnitude higher than a cache hit. The cost of a cache miss that is actually a reference to a location contained in the cache of another processor (accessing actively shared data, called a "snoop") is about two orders of magnitude higher than a cache hit. Even a small cache miss rate (3-4%) can cause the majority of the processor time to be spent processing cache misses. Even a very small snoop rate (.3%) can dominate the cache miss timings. It is possible to saturate the main bus on a Model 330 just processing cache misses and snoops. The symptom of a high cache miss rate is an increase in CPU time per transaction as the number of processors is increased (i.e. nonlinear MP scaling.) 2.5. Use the shared vm cache access mode appropriate for the reference patterns of the data. Avoid excessive use of "exclusive sequential" access mode (mode 2; VOS only). Why? Data that is cached can be referenced much faster than data that is not cached. References to the cache avoid expensive main bus cycles. Some new access modes permit data to be in multiple caches; this avoids expensive snooping. Exclusive sequential access mode should be avoided because such data cannot be kept in the L1 (on-chip) cache and can appear in at most one L2 (per CPU board) cache at a time. Therefore, it is the most expensive data to access. The following table shows the three cache policies supported by VOS, the access mode to use with s$connect_vm_region, how the cache policies use the L1 and L2 caches, and how many caches the data can appear in simultaneously. FTX supports only the last two modes, using memctl and the MC_ADVISE flag. +---------------+---+-----------------+-------------+---------+ | Cache Policy | # | L1 Behavior | L2 Behavior | #Caches | |---------------|---|-----------------|-------------|---------| | Exclusive | 2 | Uncached | Write Back | One | | Sequential | | | | | |---------------|---|-----------------|-------------|---------| | Exclusive | 7 | Write Once | Write Back | One | | Nonsequential | | | | | |---------------|---|-----------------|-------------|---------| | Shared | 9 | Write Thru | Write Thru | All | | Nonsequential | | | | | +---------------+---+-----------------+-------------+---------+ #: The access mode to use with s$connect_vm_region. Uncached: This data does not appear in the cache. Write Back: Data is cached for reads and writes. Any read that misses will load the cache line. A write that misses will allocate (read then modify) the cache line. A write that hits will modify the cache line. Write Thru: Data is cached for reads. Any read that misses will load the cache line. A write that misses will go around the cache. A write that hits will modify the cache line and will also be written thru to the next level. Write Once: Data is cached for reads and writes. Any read that misses will load the cache line. A write that misses will go around the cache. A write that hits will modify the cache line. The first modification is written thru to the L2 cache. (This is done so that the L2 cache knows what data is modified in the L1 cache). #Caches: How many caches the data can appear in at any time. Access modes 2, 7 and 9 permit read and write access to every page in a vm region. The only difference is the cache policy used. There is no "right" cache policy. The performance of an application that uses shared memory is sensitive to the number of caches that the data can be in (one or all), which caches the data can be in (L1 or L2), the frequency of reads versus writes and the ability of the software to withstand the out-of-order nature of weakly ordered or write ordered caching modes. This last point is covered in the next section of this memo. The number of caches that the data can appear in is an important factor in the performance of an application. If data can appear in only one cache, then any time the cache line is loaded from memory, the system must ensure that the data is not present in another CPU's cache. If the data is present in another cache and not modified, the system must wait for the data to be invalidated before it can be read from memory. This waiting more than doubles the amount of time that it takes to read data from main memory. If the data is present in another cache and has been modified, the system must wait for the data to be written back to main memory before it can be read. This form of waiting can triple or quadruple the amount of time that it takes to read data from main memory. If data can appear in all caches, then it can only be modified in one of them. This is accomplished by having each CPU invalidate its own copy when it sees another CPU modifying it. (This is why the "all caches" policy is also "write thru"). Because this invalidation happens in the background and does not require additional bus cycles, there is no additional time required to read data from main memory. Reads are now as efficient as possible; unfortunately, all writes now go all the way to main memory. Whether the reduction in read time offsets the increase in write time depends on the reference patterns of the application. The protocol used to implement cache consistency is called "snooping" and can account for two-thirds or more of the main bus activity. In 6-cpu systems an application using large amounts of shared memory can saturate the main bus, which leads to further system slowdown due to queueing effects. Thus, it is vital to select a caching policy that matches the usage patterns of the shared data. Data that is mostly read and rarely written may be best off in shared nonsequential access mode. This is because unmodified data may appear in all of the caches in the system. Writes in this model are relatively inefficient, as they must invalidate all the other CPU's caches, and must be written all the way to main memory. Data that is more heavily modified may be better off in one of the exclusive access modes. Exclusive nonsequential would normally be preferable, as it makes use of the L1 (on-chip) cache, but its weakly ordered memory characteristics may make it unsuitable for existing applications. Writes in these modes stay in the cache. The drawback is that data can appear in only one cache at a time. This results in long delays and increased bus utilization as data is shuffled around the system. Experience with several different benchmarks suggests that most data is read much more often than it is written, and so more weight should be given to selecting a policy that optimizes the performance of reads. Experience also suggests that most user applications are insensitive to memory access reordering issues, either by virtue of careful use of locks around accesses to shared data, or absence of sophisticated synchronization algorithms. These considerations suggest that shared non- sequential is probably the best overall cache policy. By setting up multiple shared memory regions, an application could use different caching policies for different data, as appropriate. The tables below attempt to summarize the above information. The adjectives used are intended to show broad areas of similar performance. Because it is difficult to know the referencing patterns of an application and the overall effect of each cache policy on system efficiency, the safest approach is to run benchmarks to determine the best policy or policies to use. DATA THAT IS LIGHTLY SHARED +---------------+---+--------------+---------------+-------------+ | Cache Policy | # | Mostly Reads | Mostly Writes | Typical R/W | |---------------|---|--------------+---------------+-------------+ | Exclusive | 2 | Poor | Good | Poor | | Sequential | | | | | |---------------|---|--------------+---------------+-------------+ | Exclusive | 7 | Good | Good | Good | | Nonsequential | | | | | |---------------|---|--------------+---------------+-------------+ | Shared | 9 | Excellent | Poor | Good | | Nonsequential | | | | | +---------------+---+--------------+---------------+-------------+ DATA THAT IS HEAVILY SHARED +---------------+---+--------------+---------------+-------------+ | Cache Policy | # | Mostly Reads | Mostly Writes | Typical R/W | |---------------|---|--------------+---------------+-------------+ | Exclusive | 2 | Very Poor | Very Poor | Very Poor | | Sequential | | | | | |---------------|---|--------------+---------------+-------------+ | Exclusive | 7 | Poor | Poor | Poor | | Nonsequential | | | | | |---------------|---|--------------+---------------+-------------+ | Shared | 9 | Excellent | Poor | Good | | Nonsequential | | | | | +---------------+---+--------------+---------------+-------------+ 2.6 Use locks to protect critical regions rather than relying on memory reference ordering. (Memory reference ordering refers to the order that reads and writes actually occur in a multi- processor shared memory computer system). Why? A side-effect of the cache policies outlined in the previous section is that they each have different memory ordering characteristics. The following table relates cache policies to memory ordering modes. ACCESS MODE CACHE POLICY MEMORY ORDERING MODE ---- ------------------------ ----------------------- 2 exclusive sequential sequentially consistent 7 exclusive non-sequential weakly ordered (*) 9 shared non-sequential write ordered (*) on Atlantic rev 43 and above, this mode is write ordered. No matter which cache policy is used, a program running on a given processor always sees a sequentially consistent view of its world; it always reads the data that it has written. These memory ordering modes only affect the view seen by another processor. Sequentially-consistent means that other processors in an MP system see reads and writes to main memory appear in the traditional order. Write-ordered means that other processors see writes occur in order, but no guarantee is made about reads versus writes. Weakly-ordered means that other processors have no guarantee about the order of reads vs. writes, or about the order of some writes vs. other writes. (Other processors DO see multiple writes to the same location in the proper order) The easiest way to avoid any dependency on memory order is to use locks. If all references to shared memory use locks, then all memory ordering modes will give identical results. The basic sequence should be: * lock the lock that protects the shared data * reference the data (reads and/or writes) * unlock the lock Examples of code sequences that are not well-behaved are given in a companion memo. 2.7. Make sure that cache lines are fully utilized. Cache lines are 32 bytes and ALL of this data is read or written back if modified. (Previous Stratus processors only wrote back the modified locations). Put another way, ensure that as much data as possible fits in each cache line. Rearrange array subscripts, loop nesting, etc., so that data is accessed sequentially in main memory. Rearrange data structures to maximize locality of references. Pack data that is referenced together into as few cache lines as possible. Consider allocating large structures on cache line boundaries so that you can control exactly which variables occupy the same cache line. Why? Loading cache lines, especially cache lines for shared data is much more expensive than referencing data that is already in the cache. Eliminating excess cache line loads speeds up the software and reduces main bus traffic, which leaves more bus traffic available for other processors. 3. NOTES (VOS Only) 3.1. The VOS scheduling algorithms have been changed on multiprocessor XA/R ("Atlantic") systems to optimize the assignment of processes to processors. Loading the processor cache with the "cache working set" of the process is a fairly expensive operation, both for the process and for the system. Since the performance of the executing program is sensitive to whether or not the cache is already loaded, VOS now remembers which CPU a process last ran on, and attempts to run the process on the same CPU the next time it runs. The internal name for this feature is the "tachy" scheduler; from the prefix "tachy-" that comes from the Greek word "tachus" meaning swift. (It is also a pun on the word tacky for stickiness). These changes apply only to multiprocessor XA/R models, which are models 15, 45, 55, 310, 320, 330 and 340. They do not apply to the uniprocessor models, which are models 5, 10, 20, 25, 35, 300 and 305. The algorithm for VOS 11.5 thru 11.7 is as follows: When a process gives up the CPU, the CPU number and wall clock time are recorded in its PTE (process table entry). Later, when a CPU is in the scheduler and looking for work to do, it scans the ready-to-run processes in priority order looking for the highest-priority process that meets its scheduling criteria. A candidate process is scheduled if (a) it is tachyed to the current CPU, or (b) it can tachy to any CPU (*1), or (c) it is tachyed to a different CPU but finished running there more than 408 milliseconds ago (*2). Otherwise the CPU goes idle. (*1) Processes are in this state if they are newly-created or if they are starting a new scheduling quantum. (*2) This is called an "offboard timeout". A ready process that waits more than 408 milliseconds without running will be run by the next available CPU. This value may be modified by patching sys_info$user_cache_decay (an 8-byte value that contains the number of jiffies to wait; jiffies are 1/65536th of a second). As of VOS Release 11.8, all versions of VOS Release 12 and subsequent releases, the algorithm is improved to give preference to running a process on the other CPU of the same board (i.e., on the twin CPU) before moving it to a completely different board. The new algorithm is: A candidate process is scheduled if (a) it is tachyed to the current CPU, or (b) it can tachy to any CPU (*1), or (c) it is tachyed to the other CPU on the same board but finished running there more than 204 milliseconds ago (*3) (d) it is tachyed to a different CPU but finished running there more than 408 milliseconds ago (*2) Otherwise the CPU goes idle. (*3) This is called a "twin timeout". A ready process that waits more than 204 milliseconds without running is eligible to run on the twin CPU. This value may be modified by patching sys_info$twin_cache_timeout (an 8-byte value that contains the number of jiffies to wait; jiffies are 1/65536th of a second). Note: Both versions of the tachy scheduling algorithm force a process to wait up to 408 milliseconds (in the worst case) to see if the CPU that it last ran on becomes available. When a CPU is looking for a process to run, it will skip processes whose tachy timeouts have not expired. In order to avoid going idle, the CPU will run other ready processes, even lower-priority processes. If no other processes are ready, then the CPU will idle. Both kernel variables (sys_info$twin_cache_timeout and sys_info$offboard_cache_timeout) are 8-byte integers (PL/I fixed decimal (15)). The default value of sys_info$twin_cache_timeout is 00000000 00003458, which is approximately 204 milliseconds. The default value of sys_info$offboard_cache_timeout is 00000000 000068B0, which is approximately 408 milliseconds. These values were determined experimentally; by running benchmarks and finding the values that optimized both thruput and response time. Normally there should be no need to change them, but you can experiment with different values if you wish. As we gained experience with the "twin tachy" scheduler in VOS Release 11.8 and up, we determined that we could further reduce the tachy timeout values. This change improves response times without significantly impacting system efficiency. As of VOS Release 13.3 and subsequent releases, the twin tachy timeout value (sys_info$twin_cache_timeout) is 25 milliseconds (00000000 00000666) and the value of the offboard cache timeout is 100 milliseconds (00000000 0000199A). If you are running any version of VOS Release 11.8 or VOS Release 12, and you would like the benefit of the VOS 13.3 improvement, you can patch the timeout variables to the values they have in VOS Release 13.3. They can be patched in the partition or online. Any customer that feels they need further reductions in the timeout values should contact Stratus VOS Engineering for assistance and guidance before proceeding. 3.2 Super Tachy Option Normally, VOS resets the tachy cpu binding (the memory of which cpu a process last ran on) at the end of a scheduling quantum. The default scheduling quanta can be displayed via the display_scheduler_info command. Each line of output is a different quantum. Since a quantum is a few hundred milliseconds to a few seconds, a process is bound to a CPU for a substantial period of time. By occasionally resetting the tachy binding, VOS forces the system to rebalance the workload and reduces the impact of unbalanced workloads. (An unbalanced workload is one where too many processes are tachied to one CPU and not enough are tachied to another). The only other way for a process to get a new tachy CPU binding is for a tachy timeout to occur. Normally these two mechanisms work well to provide a well- balanced load. However, if an application process is a heavy user of s$sleep, and is sleeping for short periods of time (less than the tachy timeout intervals), then there can be problems. The reason is that s$sleep resets your quantum type to the first quantum type; by so doing it also resets the tachy binding. If the sleep was for a few seconds or minutes it is not a big problem (the cache data would be long gone anyway), but if the sleep is for a few tens of milliseconds, then the process will run inefficiently, since it will likely run on a different CPU each time; taking no advantage of the cache. The solution is to patch the kernel variable sys_info$super_tachy from 0 to 1. This is a 2-byte integer variable (PL/I fixed bin (15)) that controls whether the tachy binding is reset at the end of a scheduling quantum. By default (0; super tachy disabled) VOS resets the tachy binding; any nonzero value prevents the reset. This value can be patched in the partition or online. The new value takes effect immediately. If super tachy is enabled, then the only way for a process to get a new tachy cpu binding is for it to timeout while ready to run. (end)