The electronics will be less daunting than the mechanics & optics.
At 18 inch/s, a 10 kHz signal has a 1.8 mil wavelength, so the light must be under 1 mil = 0.45 mm along the audio track and maybe 2.5 mm across. Some poking around suggests 10 kHz is a bit over the actual bandwidth, but it’s a reasonable starting point.
The illuminator must produce a sheet of light collimated well enough to account for film bendiness / wobbulation in the gate. Basically, you want a rectangle of light passing vertically through the film.
On the other side of the film, a magnifier images the (now modulated) light on a photodiode detector or, for stereo, two detectors. The photodiodes need enough illumination (more bright light!) to produce a good SNR and a current-to-voltage amplifier to extract it. After that, it’s a reasonably low bandwidth audio preamp.
I’d start with photo-etched slit masks on both sides of the film and cylindrical lenses for collimation, although I think the right way has real lenses imaging aperture slits on the film to maintain enough clearance. I don’t know enough optics to have a real opinion.
The audio signal will vary with the (nearly constant) film speed, as will the encoder signal from the capstan drive. Given that the encoder signal should be a constant frequency, some DSP magic can determine the frequency error and the adjustment required to correct it, then apply the same tweak to the audio track signal.
All of which is easier in theory than in practice … [grin]