Add API routine to run code in every thread for updating global flags
On AArchXX architectures, reading a global flag stored in memory in each thread requires a load-acquire operation, which is more expensive than a simple load. For rarely-changed flags, what I'd like to do is have each thread use a regular load, and have an expensive way to perform the load-acquire in every thread all at once when the flag changes. Even better would be to have each thread use a thread-private copy of a flag, and have a way to have each thread load the global value into the private copy when the global changes.
My proposal is a variant of dr_suspend_all_other_threads_ex() that takes in a callback and invokes the callback for every thread. In many use cases it does not need to wait for all threads to reach a synch point: it just sends each a signal and in the handler it invokes the callback (so the callback has to be safe to run from the handler; but I would expect it to normally just do a load-acquire from the global and then store into the private slot). If a synch point is needed, we could have a callback that's invoked from the suspension point of dr_suspend_all_other_threads_ex().
My use case is #3995 where I want to change my drbbdup (#4134 (closed)) case value to swap between instrumentation modes. The idea is to avoid the cost of reading the global flag holding the mode at the top of every single block.