Mm.. what normally happens is that a graphics card will peak at 100%, but if you checked in higher intervals, the graph would be anywhere from 60-100%.
How this really happens is that the graphics card will receive a short command queue (in normal circumstances, 2 at the same time :p More than 4 and you get into trouble). So you get a preparation phase and an execution phase, something like that. So there's always downtime on the gpu between context shifts. Meanwhile a lot of operations don't reserve all the shaders at once, so what the architecture is doing is to exploit that fact and essentially fold the chip in two (fold it again and things croak). So you can use all the shaders for two separate concurrently submitted shader operations.
The fact that practically all games distribute loads like this (in bursts, fps-drops in general come from the peak load games generate at random times, etc... very dumb way to make games) is the reason for Nvidia's entire kepler to Pascal line, with the boost clocks, on the one hand. Since it allows them to boost the graphics card during the actual loads. And generally increase the performance, without increasing the temperatures to the point of a good overclock. And with the reduced number of pipes and individual shader-units/smx-es (read: bad cpu at very low speeds with a limited simd instruction set that costs approximately a rice-corn to create. If you could put fifteen of these on the same chip for 4 rice-corns instead of 15, and then make the total module 15 times smaller, then of course you're going to do that. Since you could reduce the production costs by a great deal in unifying more elements on one chip, or at least reducing the amount of Foxconn mounting costs).
In short, all the later nvidia cards have fewer individual shader-units, collected in what they call "smx"es. And utilize simple internal scheduling to split types of operations that are submitted concurrently so they can be executed linearly with a higher amount of allocated shader-units than normal. While then the clock-speed of the card is boosted during the actual execution phase..
So what happens with NMS is that it will screw everything up and allocate all the shader-units it can. And then expects those units to be available for concurrent calls during runtime. This is.. probably some of the per frame fetches, some of the shadow effects maybe, a few other things. Parts of the havok library perhaps.
And now the rendering context and additional subroutines has mapped all "theoretically" available shader-units on your folded kepler chip. But still, so far so good, because the scheduler can handle this. Until the engine submits a command that requires a context shift, and there's a hicup.
Since the scheduler on these cards isn't designed for prioritizing which thread is needed to complete the next frame, for example (it's a simple and completely dumb linear system, only slightly more advanced than the scheduler in Windows - read: a lazy janitor who picks up things he sees, and the idles for a bit) , this context shift slow down effect can sometimes be exaggerated a lot. You have compute examples where you can create unsolvable race-conditions with three commands being resubmitted over and over.. you know. It's a 2/3 chance every cycle that the wrong command will be picked. Which is what the scheduler really wants, because then it can avoid the cost of a context shift, which is extremely expensive resource-wise. So unfortunate things are going to happen sooner or later, no matter how lucky you are.
So there are some potential pitfalls with use of compute type operations on all gtx nvidia cards since kepler. Which practically appears to you as fullly occupied smx-units. But often comes along with race-conditions that prevent the calculations from completing in time for the framerate not to drop, at least once in a while.