openCV has quite a few algorithms (optical flow/tracker) which might be able to estimate the shifts you are describing in your video. Some are also reasonably fast.
So one could just do the capture with a mostly free running camera, store the resulting images on disk and align them (calculate your “L”) in a later processing step.
One needs a sufficiently larger overlap between neighbouring captures, slightly more disk space (because of the overlap required between frames) and some additional processing time (from my experience, anything between 300 msec and 1 sec per frame).