Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

assembly - Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

I assume simple spinlock that does not go to OS waiting for the purposes of this question.

I see that simple spinlock is often implemented using lock xchg or lock bts instead of lock cmpxchg.

But doesn't cmpxchg avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg?

Or does cmpxchg write data and invalidate cache line of other cores even on failure?

This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg, not in general.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

On most or all current Intel x86 processors, a lock cmpxchg to a location whose memory type is WB and is fully contained within a single L1D cache line is executed as follows:

  • A lock-read request is issued to the L1D, which brings the target line in a locked-exclusive cache coherence state and provides the requested bytes as input to one of the execution ports to perform the comparison. (Cache locking is supported since the P6.) A line in a locked state cannot be invalidated or evicted for any reason.
  • Perform the comparison for equality.
  • Whatever the result is, issue a an unlock-write request to the L1D, which changes the state of the cache line to Modified and unlocks the line, thereby allowing other access or coherence requests to replace or invalidate the line.

The first and last steps can be observed empirically using either certain performance events or latency-based measurements. One way would be to allocate a large array of atomic variables and then execute lock cmpxchg in a loop over that array. The lock-read request type is one of the types of RFO requests. So the L2_TRANS.RFO event (or what's equivalent), which is reliable on most microarchitectures, can be used to measure the number of lock-reads to the L2. (L2_TRANS.RFO counts demand RFOs, so it's better to turn off the hardware prefetchers to avoid unwanted hits in the L2. This also applies to L2_RQSTS.RFO_*.)

There are also events for measuring the number of writebacks, such as L2_TRANS.L1D_WB, L2_TRANS.L2_WB, and others. Unfortunately, many of these events and across many microarchtiectures either undercount, overcount, or they count accurately but not necessarily all/only dirty cache line writebacks. So they are more difficult to reason with and in general not reliable.

A better way would be to execute lock cmpxchg on one section of the array on a particular physical core, then migrate the thread to another physical core (in the same L3 sharing domain) and execute a loop in which the elements of that section are read (normal reads). If the lock cmpxchg instruction puts the target line in the M state, a read request from another physical core in the same L3 sharing domain should hit in the L3 and also hit-modified in the private caches of the core on which lock cmpxchg was executed. These events can be counted using OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE (or what's equivalent), which is reliable on most/all microarchitectures.

A locked instruction is an expensive operation for three reasons: (1) Requires bringing the line in an exclusive state, (2) Makes the line dirty (possibly unnecessarily) and too many writebacks can have a significant impact on execution time, even more so when they end up stealing main memory bandwidth from long stretches of read requests, and even more so when the writes are to persistent memory, and (3) They are architecturally serializing, which makes the instruction on the critical path.

Intel has a patent that proposes an optimization for the last one, where the core optimistically assumes that there is no lock contention and issues a speculative normal load to the target line. If the line is not present in any other physical core, the line will be in an exclusive state in the requesting core. Then when the locked instruction executes and issues the lock-read request, the line would hopefully still be in the exclusive state, in which case the total latency of the locked instruction would be reduced. I don't know if any processor implements this optimization. If it's implemented, the number of L2_TRANS.RFO events would be much smaller than the number of lines locked.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...