LLVM-MCA(1)



LLVM-MCA(1)                          LLVM                          LLVM-MCA(1)

NAME
       llvm-mca - LLVM Machine Code Analyzer

SYNOPSIS
       llvm-mca [options] [input]

DESCRIPTION
       llvm-mca is a performance analysis tool that uses information available
       in LLVM (e.g. scheduling models) to statically measure the  performance
       of machine code in a specific CPU.

       Performance is measured in terms of throughput as well as processor re-
       source consumption. The tool currently works  for  processors  with  an
       out-of-order  backend,  for which there is a scheduling model available
       in LLVM.

       The main goal of this tool is not just to predict  the  performance  of
       the  code  when run on the target, but also help with diagnosing poten-
       tial performance issues.

       Given an assembly code sequence, llvm-mca  estimates  the  Instructions
       Per  Cycle  (IPC),  as well as hardware resource pressure. The analysis
       and reporting style were inspired by the IACA tool from Intel.

       For example, you can compile code with clang, output assembly, and pipe
       it directly into llvm-mca for analysis:

          $ clang foo.c -O2 -target x86_64-unknown-unknown -S -o - | llvm-mca -mcpu=btver2

       Or for Intel syntax:

          $ clang foo.c -O2 -target x86_64-unknown-unknown -mllvm -x86-asm-syntax=intel -S -o - | llvm-mca -mcpu=btver2

OPTIONS
       If  input is "-" or omitted, llvm-mca reads from standard input. Other-
       wise, it will read from the specified filename.

       If the -o option is omitted, then llvm-mca  will  send  its  output  to
       standard  output if the input is from standard input.  If the -o option
       specifies "-", then the output will also be sent to standard output.

       -help  Print a summary of command line options.

       -mtriple=<target triple>
              Specify a target triple string.

       -march=<arch>
              Specify the architecture for which to analyze the code.  It  de-
              faults to the host default target.

       -mcpu=<cpuname>
              Specify  the  processor  for  which to analyze the code.  By de-
              fault, the cpu name is autodetected from the host.

       -output-asm-variant=<variant id>
              Specify the output assembly variant for the report generated  by
              the  tool.   On  x86,  possible  values are [0, 1]. A value of 0
              (vic. 1) for this flag enables the AT&T  (vic.  Intel)  assembly
              format  for the code printed out by the tool in the analysis re-
              port.

       -dispatch=<width>
              Specify a different dispatch width for the processor.  The  dis-
              patch  width  defaults  to  field  'IssueWidth' in the processor
              scheduling model.  If width is zero, then the  default  dispatch
              width is used.

       -register-file-size=<size>
              Specify the size of the register file. When specified, this flag
              limits how many physical registers are  available  for  register
              renaming  purposes.  A value of zero for this flag means "unlim-
              ited number of physical registers".

       -iterations=<number of iterations>
              Specify the number of iterations to run. If this flag is set  to
              0,  then  the  tool  sets  the number of iterations to a default
              value (i.e. 100).

       -noalias=<bool>
              If set, the tool assumes that loads and stores don't alias. This
              is the default behavior.

       -lqueue=<load queue size>
              Specify  the  size of the load queue in the load/store unit emu-
              lated by the tool.  By default, the tool assumes an unbound num-
              ber of entries in the load queue.  A value of zero for this flag
              is ignored, and the default load queue size is used instead.

       -squeue=<store queue size>
              Specify the size of the store queue in the load/store unit  emu-
              lated  by the tool. By default, the tool assumes an unbound num-
              ber of entries in the store queue. A value of zero for this flag
              is ignored, and the default store queue size is used instead.

       -timeline
              Enable the timeline view.

       -timeline-max-iterations=<iterations>
              Limit the number of iterations to print in the timeline view. By
              default, the timeline view prints information for up to 10 iter-
              ations.

       -timeline-max-cycles=<cycles>
              Limit the number of cycles in the timeline view. By default, the
              number of cycles is set to 80.

       -resource-pressure
              Enable the resource pressure view. This is enabled by default.

       -register-file-stats
              Enable register file usage statistics.

       -dispatch-stats
              Enable extra dispatch statistics. This view  collects  and  ana-
              lyzes  instruction  dispatch  events,  as well as static/dynamic
              dispatch stall events. This view is disabled by default.

       -scheduler-stats
              Enable extra scheduler statistics. This view collects  and  ana-
              lyzes  instruction  issue  events.  This view is disabled by de-
              fault.

       -retire-stats
              Enable extra retire control unit statistics. This view  is  dis-
              abled by default.

       -instruction-info
              Enable the instruction info view. This is enabled by default.

       -all-stats
              Print all hardware statistics. This enables extra statistics re-
              lated to the dispatch logic, the hardware schedulers, the regis-
              ter  file(s),  and  the retire control unit. This option is dis-
              abled by default.

       -all-views
              Enable all the view.

       -instruction-tables
              Prints resource pressure information based on the static  infor-
              mation available from the processor model. This differs from the
              resource pressure view because it doesn't require that the  code
              is  simulated. It instead prints the theoretical uniform distri-
              bution of resource pressure for every instruction in sequence.

EXIT STATUS
       llvm-mca returns 0 on success. Otherwise, an error message  is  printed
       to standard error, and the tool returns 1.

USING MARKERS TO ANALYZE SPECIFIC CODE BLOCKS
       llvm-mca allows for the optional usage of special code comments to mark
       regions of the assembly code to be analyzed.  A comment  starting  with
       substring  LLVM-MCA-BEGIN  marks the beginning of a code region. A com-
       ment starting with substring LLVM-MCA-END marks the end of a  code  re-
       gion.  For example:

          # LLVM-MCA-BEGIN My Code Region
            ...
          # LLVM-MCA-END

       Multiple regions can be specified provided that they do not overlap.  A
       code region can have an optional description. If no user-defined region
       is specified, then llvm-mca assumes a default region which contains ev-
       ery instruction in the input file.  Every region is analyzed in  isola-
       tion,  and the final performance report is the union of all the reports
       generated for every code region.

       Inline assembly directives may be used from source code to annotate the
       assembly text:

          int foo(int a, int b) {
            __asm volatile("# LLVM-MCA-BEGIN foo");
            a += 42;
            __asm volatile("# LLVM-MCA-END");
            a *= b;
            return a;
          }

HOW LLVM-MCA WORKS
       llvm-mca takes assembly code as input. The assembly code is parsed into
       a sequence of MCInst with the help of the existing LLVM target assembly
       parsers.  The  parsed sequence of MCInst is then analyzed by a Pipeline
       module to generate a performance report.

       The Pipeline module simulates the execution of  the  machine  code  se-
       quence  in  a loop of iterations (default is 100). During this process,
       the pipeline collects a number of execution related statistics. At  the
       end  of  this  process, the pipeline generates and prints a report from
       the collected statistics.

       Here is an example of a performance report generated by the tool for  a
       dot-product  of two packed float vectors of four elements. The analysis
       is conducted for target x86, cpu btver2.  The following result  can  be
       produced  via  the  following  command  using  the  example  located at
       test/tools/llvm-mca/X86/BtVer2/dot-product.s:

          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s

          Iterations:        300
          Instructions:      900
          Total Cycles:      610
          Total uOps:        900

          Dispatch Width:    2
          uOps Per Cycle:    1.48
          IPC:               1.48
          Block RThroughput: 2.0

          Instruction Info:
          [1]: #uOps
          [2]: Latency
          [3]: RThroughput
          [4]: MayLoad
          [5]: MayStore
          [6]: HasSideEffects (U)

          [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
           1      2     1.00                        vmulps      %xmm0, %xmm1, %xmm2
           1      3     1.00                        vhaddps     %xmm2, %xmm2, %xmm3
           1      3     1.00                        vhaddps     %xmm3, %xmm3, %xmm4

          Resources:
          [0]   - JALU0
          [1]   - JALU1
          [2]   - JDiv
          [3]   - JFPA
          [4]   - JFPM
          [5]   - JFPU0
          [6]   - JFPU1
          [7]   - JLAGU
          [8]   - JMul
          [9]   - JSAGU
          [10]  - JSTC
          [11]  - JVALU0
          [12]  - JVALU1
          [13]  - JVIMUL

          Resource pressure per iteration:
          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
           -      -      -     2.00   1.00   2.00   1.00    -      -      -      -      -      -      -

          Resource pressure by instruction:
          [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
           -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmulps      %xmm0, %xmm1, %xmm2
           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm2, %xmm2, %xmm3
           -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vhaddps     %xmm3, %xmm3, %xmm4

       According to this report, the dot-product kernel has been executed  300
       times,  for  a total of 900 simulated instructions. The total number of
       simulated micro opcodes (uOps) is also 900.

       The report is structured in three main  sections.   The  first  section
       collects a few performance numbers; the goal of this section is to give
       a very quick overview of the performance throughput. Important  perfor-
       mance indicators are IPC, uOps Per Cycle, and  Block RThroughput (Block
       Reciprocal Throughput).

       IPC is computed dividing the total number of simulated instructions  by
       the  total number of cycles. In the absence of loop-carried data depen-
       dencies, the observed IPC tends to a theoretical maximum which  can  be
       computed  by  dividing the number of instructions of a single iteration
       by the Block RThroughput.

       Field 'uOps Per Cycle' is computed dividing the total number  of  simu-
       lated micro opcodes by the total number of cycles. A delta between Dis-
       patch Width and this field is an indicator of a performance  issue.  In
       the  absence  of loop-carried data dependencies, the observed 'uOps Per
       Cycle' should tend to a theoretical maximum  throughput  which  can  be
       computed  by  dividing  the number of uOps of a single iteration by the
       Block RThroughput.

       Field uOps Per Cycle is bounded from above by the dispatch width.  That
       is  because  the  dispatch  width limits the maximum size of a dispatch
       group. Both IPC and 'uOps Per Cycle' are limited by the amount of hard-
       ware  parallelism.  The  availability of hardware resources affects the
       resource pressure distribution, and it limits the  number  of  instruc-
       tions  that  can  be executed in parallel every cycle.  A delta between
       Dispatch Width and the theoretical maximum uOps per Cycle (computed  by
       dividing  the  number  of  uOps  of  a  single  iteration  by the Block
       RTrhoughput) is an indicator of a performance bottleneck caused by  the
       lack  of hardware resources.  In general, the lower the Block RThrough-
       put, the better.

       In this example, uOps per iteration/Block RThroughput  is  1.50.  Since
       there  are no loop-carried dependencies, the observed uOps Per Cycle is
       expected to approach 1.50 when the number of iterations tends to infin-
       ity.  The  delta between the Dispatch Width (2.00), and the theoretical
       maximum uOp throughput (1.50) is an indicator of a performance  bottle-
       neck  caused  by the lack of hardware resources, and the Resource pres-
       sure view can help to identify the problematic resource usage.

       The second section of the  report  shows  the  latency  and  reciprocal
       throughput  of every instruction in the sequence. That section also re-
       ports extra information related to the number of micro opcodes, and op-
       code properties (i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').

       The third section is the Resource pressure view.  This view reports the
       average number of resource cycles consumed every iteration by  instruc-
       tions  for  every processor resource unit available on the target.  In-
       formation is structured in two tables. The first table reports the num-
       ber of resource cycles spent on average every iteration. The second ta-
       ble correlates the resource cycles to the machine  instruction  in  the
       sequence. For example, every iteration of the instruction vmulps always
       executes on resource unit [6] (JFPU1 -  floating  point  pipeline  #1),
       consuming  an  average of 1 resource cycle per iteration.  Note that on
       AMD Jaguar, vector floating-point multiply can only be issued to  pipe-
       line  JFPU1,  while horizontal floating-point additions can only be is-
       sued to pipeline JFPU0.

       The resource pressure view helps with identifying bottlenecks caused by
       high  usage  of  specific hardware resources.  Situations with resource
       pressure mainly concentrated on a few resources should, in general,  be
       avoided.   Ideally,  pressure  should  be uniformly distributed between
       multiple resources.

   Timeline View
       The timeline view produces a  detailed  report  of  each  instruction's
       state  transitions  through  an instruction pipeline.  This view is en-
       abled by the command line option -timeline.  As instructions transition
       through  the  various stages of the pipeline, their states are depicted
       in the view report.  These states  are  represented  by  the  following
       characters:

       o D : Instruction dispatched.

       o e : Instruction executing.

       o E : Instruction executed.

       o R : Instruction retired.

       o = : Instruction already dispatched, waiting to be executed.

       o - : Instruction executed, waiting to be retired.

       Below  is the timeline view for a subset of the dot-product example lo-
       cated in test/tools/llvm-mca/X86/BtVer2/dot-product.s and processed  by
       llvm-mca using the following command:

          $ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s

          Timeline view:
                              012345
          Index     0123456789

          [0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
          [0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
          [0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
          [1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
          [1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
          [1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
          [2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
          [2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
          [2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4

          Average Wait times (based on the timeline view):
          [0]: Executions
          [1]: Average time spent waiting in a scheduler's queue
          [2]: Average time spent waiting in a scheduler's queue while ready
          [3]: Average time elapsed from WB until retire stage

                [0]    [1]    [2]    [3]
          0.     3     1.0    1.0    3.3       vmulps   %xmm0, %xmm1, %xmm2
          1.     3     3.3    0.7    1.0       vhaddps  %xmm2, %xmm2, %xmm3
          2.     3     5.7    0.0    0.0       vhaddps  %xmm3, %xmm3, %xmm4

       The  timeline  view  is  interesting because it shows instruction state
       changes during execution.  It also gives an idea of how the  tool  pro-
       cesses instructions executed on the target, and how their timing infor-
       mation might be calculated.

       The timeline view is structured in two tables.  The first  table  shows
       instructions  changing state over time (measured in cycles); the second
       table (named Average Wait  times)  reports  useful  timing  statistics,
       which  should help diagnose performance bottlenecks caused by long data
       dependencies and sub-optimal usage of hardware resources.

       An instruction in the timeline view is identified by a pair of indices,
       where  the first index identifies an iteration, and the second index is
       the instruction index (i.e., where it appears in  the  code  sequence).
       Since this example was generated using 3 iterations: -iterations=3, the
       iteration indices range from 0-2 inclusively.

       Excluding the first and last column, the remaining columns are  in  cy-
       cles.  Cycles are numbered sequentially starting from 0.

       From the example output above, we know the following:

       o Instruction [1,0] was dispatched at cycle 1.

       o Instruction [1,0] started executing at cycle 2.

       o Instruction [1,0] reached the write back stage at cycle 4.

       o Instruction [1,0] was retired at cycle 10.

       Instruction  [1,0]  (i.e.,  vmulps  from iteration #1) does not have to
       wait in the scheduler's queue for the operands to become available.  By
       the  time  vmulps  is  dispatched,  operands are already available, and
       pipeline JFPU1 is ready to serve another instruction.  So the  instruc-
       tion  can  be  immediately issued on the JFPU1 pipeline. That is demon-
       strated by the fact that the instruction only spent 1cy in  the  sched-
       uler's queue.

       There  is a gap of 5 cycles between the write-back stage and the retire
       event.  That is because instructions must retire in program  order,  so
       [1,0]  has  to wait for [0,2] to be retired first (i.e., it has to wait
       until cycle 10).

       In the example, all instructions are in a RAW (Read After Write) depen-
       dency  chain.   Register %xmm2 written by vmulps is immediately used by
       the first vhaddps, and register %xmm3 written by the first  vhaddps  is
       used  by  the second vhaddps.  Long data dependencies negatively impact
       the ILP (Instruction Level Parallelism).

       In the dot-product example, there are anti-dependencies  introduced  by
       instructions  from  different  iterations.  However, those dependencies
       can be removed at register renaming stage (at the  cost  of  allocating
       register aliases, and therefore consuming physical registers).

       Table  Average  Wait  times  helps diagnose performance issues that are
       caused by the presence of long  latency  instructions  and  potentially
       long data dependencies which may limit the ILP.  Note that llvm-mca, by
       default, assumes at least 1cy between the dispatch event and the  issue
       event.

       When  the  performance  is limited by data dependencies and/or long la-
       tency instructions, the number of cycles spent while in the ready state
       is expected to be very small when compared with the total number of cy-
       cles spent in the scheduler's queue.  The difference  between  the  two
       counters  is  a good indicator of how large of an impact data dependen-
       cies had on the execution of the  instructions.   When  performance  is
       mostly limited by the lack of hardware resources, the delta between the
       two counters is small.  However, the number  of  cycles  spent  in  the
       queue  tends to be larger (i.e., more than 1-3cy), especially when com-
       pared to other low latency instructions.

   Extra Statistics to Further Diagnose Performance Issues
       The -all-stats command line option enables extra statistics and perfor-
       mance  counters  for the dispatch logic, the reorder buffer, the retire
       control unit, and the register file.

       Below is an example of -all-stats output generated by  llvm-mca for 300
       iterations  of  the  dot-product example discussed in the previous sec-
       tions.

          Dynamic Dispatch Stall Cycles:
          RAT     - Register unavailable:                      0
          RCU     - Retire tokens unavailable:                 0
          SCHEDQ  - Scheduler full:                            272  (44.6%)
          LQ      - Load queue full:                           0
          SQ      - Store queue full:                          0
          GROUP   - Static restrictions on the dispatch group: 0

          Dispatch Logic - number of cycles where we saw N micro opcodes dispatched:
          [# dispatched], [# cycles]
           0,              24  (3.9%)
           1,              272  (44.6%)
           2,              314  (51.5%)

          Schedulers - number of cycles where we saw N instructions issued:
          [# issued], [# cycles]
           0,          7  (1.1%)
           1,          306  (50.2%)
           2,          297  (48.7%)

          Scheduler's queue usage:
          [1] Resource name.
          [2] Average number of used buffer entries.
          [3] Maximum number of used buffer entries.
          [4] Total number of buffer entries.

           [1]            [2]        [3]        [4]
          JALU01           0          0          20
          JFPU01           17         18         18
          JLSAGU           0          0          12

          Retire Control Unit - number of cycles where we saw N instructions retired:
          [# retired], [# cycles]
           0,           109  (17.9%)
           1,           102  (16.7%)
           2,           399  (65.4%)

          Total ROB Entries:                64
          Max Used ROB Entries:             35  ( 54.7% )
          Average Used ROB Entries per cy:  32  ( 50.0% )

          Register File statistics:
          Total number of mappings created:    900
          Max number of mappings used:         35

          *  Register File #1 -- JFpuPRF:
             Number of physical registers:     72
             Total number of mappings created: 900
             Max number of mappings used:      35

          *  Register File #2 -- JIntegerPRF:
             Number of physical registers:     64
             Total number of mappings created: 0
             Max number of mappings used:      0

       If we look at the Dynamic Dispatch  Stall  Cycles  table,  we  see  the
       counter for SCHEDQ reports 272 cycles.  This counter is incremented ev-
       ery time the dispatch logic is unable to dispatch a full group  because
       the scheduler's queue is full.

       Looking  at the Dispatch Logic table, we see that the pipeline was only
       able to dispatch two micro opcodes 51.5% of  the  time.   The  dispatch
       group was limited to one micro opcode 44.6% of the cycles, which corre-
       sponds to 272 cycles.  The dispatch statistics are displayed by  either
       using the command option -all-stats or -dispatch-stats.

       The  next  table,  Schedulers, presents a histogram displaying a count,
       representing the number of instructions issued on some  number  of  cy-
       cles.   In  this case, of the 610 simulated cycles, single instructions
       were issued 306 times (50.2%) and there were 7 cycles where no instruc-
       tions were issued.

       The  Scheduler's  queue  usage table shows that the average and maximum
       number of buffer entries (i.e., scheduler queue entries) used  at  run-
       time.   Resource  JFPU01  reached its maximum (18 of 18 queue entries).
       Note that AMD Jaguar implements three schedulers:

       o JALU01 - A scheduler for ALU instructions.

       o JFPU01 - A scheduler floating point operations.

       o JLSAGU - A scheduler for address generation.

       The dot-product is a kernel of three  floating  point  instructions  (a
       vector  multiply  followed  by two horizontal adds).  That explains why
       only the floating point scheduler appears to be used.

       A full scheduler queue is either caused by data dependency chains or by
       a  sub-optimal  usage of hardware resources.  Sometimes, resource pres-
       sure can be mitigated by rewriting the kernel using different  instruc-
       tions  that  consume  different scheduler resources.  Schedulers with a
       small queue are less resilient to bottlenecks caused by the presence of
       long  data dependencies.  The scheduler statistics are displayed by us-
       ing the command option -all-stats or -scheduler-stats.

       The next table, Retire Control Unit, presents a histogram displaying  a
       count,  representing  the number of instructions retired on some number
       of cycles.  In this case, of the 610 simulated cycles, two instructions
       were retired during the same cycle 399 times (65.4%) and there were 109
       cycles where no instructions were retired.  The retire  statistics  are
       displayed by using the command option -all-stats or -retire-stats.

       The  last  table  presented is Register File statistics.  Each physical
       register file (PRF) used by the pipeline is presented  in  this  table.
       In the case of AMD Jaguar, there are two register files, one for float-
       ing-point registers (JFpuPRF) and one  for  integer  registers  (JInte-
       gerPRF).  The table shows that of the 900 instructions processed, there
       were 900 mappings created.  Since  this  dot-product  example  utilized
       only floating point registers, the JFPuPRF was responsible for creating
       the 900 mappings.  However, we see that the pipeline only used a  maxi-
       mum of 35 of 72 available register slots at any given time. We can con-
       clude that the floating point PRF was the only register file  used  for
       the  example, and that it was never resource constrained.  The register
       file statistics are displayed by using the command option -all-stats or
       -register-file-stats.

       In this example, we can conclude that the IPC is mostly limited by data
       dependencies, and not by resource pressure.

   Instruction Flow
       This section describes the instruction flow through the  default  pipe-
       line  of  llvm-mca,  as  well  as  the functional units involved in the
       process.

       The default pipeline implements the following sequence of  stages  used
       to process instructions.

       o Dispatch (Instruction is dispatched to the schedulers).

       o Issue (Instruction is issued to the processor pipelines).

       o Write Back (Instruction is executed, and results are written back).

       o Retire  (Instruction  is  retired; writes are architecturally commit-
         ted).

       The default pipeline only models the out-of-order portion of a  proces-
       sor.   Therefore,  the instruction fetch and decode stages are not mod-
       eled. Performance  bottlenecks  in  the  frontend  are  not  diagnosed.
       llvm-mca  assumes  that  instructions  have all been decoded and placed
       into a queue before the simulation  start.   Also,  llvm-mca  does  not
       model branch prediction.

   Instruction Dispatch
       During  the  dispatch  stage,  instructions are picked in program order
       from a queue of already decoded instructions, and dispatched in  groups
       to the simulated hardware schedulers.

       The  size  of a dispatch group depends on the availability of the simu-
       lated hardware resources.  The processor dispatch width defaults to the
       value of the IssueWidth in LLVM's scheduling model.

       An instruction can be dispatched if:

       o The  size  of the dispatch group is smaller than processor's dispatch
         width.

       o There are enough entries in the reorder buffer.

       o There are enough physical registers to do register renaming.

       o The schedulers are not full.

       Scheduling models can  optionally  specify  which  register  files  are
       available  on the processor. llvm-mca uses that information to initial-
       ize register file descriptors.  Users can limit the number of  physical
       registers  that  are  globally available for register renaming by using
       the command option -register-file-size.  A value of zero for  this  op-
       tion  means  unbounded. By knowing how many registers are available for
       renaming, the tool can predict dispatch stalls caused by  the  lack  of
       physical registers.

       The number of reorder buffer entries consumed by an instruction depends
       on the number of micro-opcodes specified for that  instruction  by  the
       target  scheduling model.  The reorder buffer is responsible for track-
       ing the progress of instructions that  are  "in-flight",  and  retiring
       them in program order.  The number of entries in the reorder buffer de-
       faults to the value specified by field MicroOpBufferSize in the  target
       scheduling model.

       Instructions  that  are  dispatched to the schedulers consume scheduler
       buffer entries. llvm-mca queries the scheduling model to determine  the
       set  of  buffered  resources  consumed by an instruction.  Buffered re-
       sources are treated like scheduler resources.

   Instruction Issue
       Each processor scheduler implements a buffer of instructions.   An  in-
       struction  has  to  wait in the scheduler's buffer until input register
       operands become available.  Only at that point,  does  the  instruction
       becomes   eligible   for  execution  and  may  be  issued  (potentially
       out-of-order) for execution.  Instruction  latencies  are  computed  by
       llvm-mca with the help of the scheduling model.

       llvm-mca's  scheduler is designed to simulate multiple processor sched-
       ulers.  The scheduler is responsible for  tracking  data  dependencies,
       and dynamically selecting which processor resources are consumed by in-
       structions.  It delegates the management of  processor  resource  units
       and resource groups to a resource manager.  The resource manager is re-
       sponsible for selecting resource units that are  consumed  by  instruc-
       tions.   For  example,  if  an  instruction  consumes 1cy of a resource
       group, the resource manager selects one of the available units from the
       group;  by default, the resource manager uses a round-robin selector to
       guarantee that resource usage  is  uniformly  distributed  between  all
       units of a group.

       llvm-mca's scheduler internally groups instructions into three sets:

       o WaitSet: a set of instructions whose operands are not ready.

       o ReadySet: a set of instructions ready to execute.

       o IssuedSet: a set of instructions executing.

       Depending  on  the  operands  availability,  instructions that are dis-
       patched to the scheduler are either placed into the WaitSet or into the
       ReadySet.

       Every cycle, the scheduler checks if instructions can be moved from the
       WaitSet to the ReadySet, and if instructions from the ReadySet  can  be
       issued to the underlying pipelines. The algorithm prioritizes older in-
       structions over younger instructions.

   Write-Back and Retire Stage
       Issued instructions are moved  from  the  ReadySet  to  the  IssuedSet.
       There,  instructions  wait  until  they reach the write-back stage.  At
       that point, they get removed from the queue and the retire control unit
       is notified.

       When  instructions  are executed, the retire control unit flags the in-
       struction as "ready to retire."

       Instructions are retired in program order.  The register file is  noti-
       fied  of the retirement so that it can free the physical registers that
       were allocated for the instruction during the register renaming stage.

   Load/Store Unit and Memory Consistency Model
       To simulate an out-of-order execution of  memory  operations,  llvm-mca
       utilizes  a simulated load/store unit (LSUnit) to simulate the specula-
       tive execution of loads and stores.

       Each load (or store) consumes an entry in the load  (or  store)  queue.
       Users  can specify flags -lqueue and -squeue to limit the number of en-
       tries in the load and store queues respectively.  The  queues  are  un-
       bounded by default.

       The  LSUnit implements a relaxed consistency model for memory loads and
       stores.  The rules are:

       1. A younger load is allowed to pass an older load only if there are no
          intervening stores or barriers between the two loads.

       2. A  younger  load is allowed to pass an older store provided that the
          load does not alias with the store.

       3. A younger store is not allowed to pass an older store.

       4. A younger store is not allowed to pass an older load.

       By default, the LSUnit optimistically assumes that loads do  not  alias
       (-noalias=true) store operations.  Under this assumption, younger loads
       are always allowed to pass older stores.  Essentially, the LSUnit  does
       not  attempt to run any alias analysis to predict when loads and stores
       do not alias with each other.

       Note that, in the case of write-combining memory, rule 3 could  be  re-
       laxed to allow reordering of non-aliasing store operations.  That being
       said, at the moment, there is no way to further relax the memory  model
       (-noalias  is  the  only  option).   Essentially, there is no option to
       specify a different memory  type  (e.g.,  write-back,  write-combining,
       write-through;  etc.)  and  consequently  to weaken, or strengthen, the
       memory model.

       Other limitations are:

       o The LSUnit does not know when store-to-load forwarding may occur.

       o The LSUnit does not know anything about cache  hierarchy  and  memory
         types.

       o The  LSUnit  does not know how to identify serializing operations and
         memory fences.

       The LSUnit does not attempt to predict if  a  load  or  store  hits  or
       misses  the L1 cache.  It only knows if an instruction "MayLoad" and/or
       "MayStore."  For loads, the scheduling model provides  an  "optimistic"
       load-to-use  latency (which usually matches the load-to-use latency for
       when there is a hit in the L1D).

       llvm-mca does not know about serializing operations  or  memory-barrier
       like  instructions.  The LSUnit conservatively assumes that an instruc-
       tion which has both "MayLoad" and unmodeled side effects behaves like a
       "soft" load-barrier.  That means, it serializes loads without forcing a
       flush of the load queue.  Similarly, instructions that  "MayStore"  and
       have  unmodeled  side  effects are treated like store barriers.  A full
       memory barrier is a "MayLoad" and "MayStore" instruction with unmodeled
       side effects.  This is inaccurate, but it is the best that we can do at
       the moment with the current information available in LLVM.

       A load/store barrier consumes one entry of  the  load/store  queue.   A
       load/store  barrier  enforces ordering of loads/stores.  A younger load
       cannot pass a load barrier.  Also, a younger store cannot pass a  store
       barrier.  A younger load has to wait for the memory/load barrier to ex-
       ecute.  A load/store barrier is "executed" when it becomes  the  oldest
       entry in the load/store queue(s). That also means, by construction, all
       of the older loads/stores have been executed.

       In conclusion, the full set of load/store consistency rules are:

       1. A store may not pass a previous store.

       2. A store may not pass a previous load (regardless of -noalias).

       3. A store has to wait until an older store barrier is fully executed.

       4. A load may pass a previous load.

       5. A load may not pass a previous store unless -noalias is set.

       6. A load has to wait until an older load barrier is fully executed.

AUTHOR
       Maintained by the LLVM Team (https://llvm.org/).

COPYRIGHT
       2003-2020, LLVM Project

8                                 2020-03-19                       LLVM-MCA(1)

Man(1) output converted with man2html
list of all man pages