Thank you for your samples!
@dan74: Whew, your sample is really a worst case scenario.
There are several technical aspects, which make it hard to get a “great” result here.
First of all, there is virtually no significant overlap between neighbour frames, so there are gaps in the audio. AEO-Light and also my software (at this point) are not able to fill gaps in the audio signal.
This causes the stammer artefacts in the audio signal of the sample.
But that is a new point that should be considered in the future, an algorithm to interpolate gaps in the audio.
The next problem is small resolution of the images, they are only in 480p.
So (without overlap) there are still 480 audio samples per frame, this results in a sampling rate of 11520Hz (at 24fps) and according to the sampling theorem the audio signal could then only have a bandwidth within 0Hz-5760Hz.
Next problem is a high jpeg compression. The audio signal is stored in the transition between the brighter and darker areas of soundtrack. Jpeg compression is DCT based, so the image is transformed to the frequency domain and is quantised to reduce data.
This also affects the data on transitions and areas in form of ringing and blocking. So while maybe areal noise is reduced, new artefacts are brought in the audio signal.
Then, there seem to be line artefacts and some kind of ghosting artefacts, maybe caused by a suboptimal CCD sensor.
Uneven background illumination and so on…
Some of this artefacts could be reduced, but it would be much better to avoid them.
But generally, worst case scenarios are absolutely welcome to push the limits of the software.
That brings me to some theoretical aspects, which should be considered during the digitisation of optical soundtracks, to make a better result.
An optimal usage of image area would be the green frame. This is the best compromise between audio overlap and movie picture resolution.
Since my software uses the image information to generate an ideal overlap, it is good to have some characteristic details like sprocket holes on it.
This is superior to stitching after audio analysis, because it avoids misinterpretations of periodic audio signals. Also the resolution between both overlap positions can be interpreted as information about shrinkage of the film.
In fact, a correct scan with 2K resolution would be absolutely enough to cover the full bandwidth of the audio signal. However 4K would be a better compromise to the movie picture resolution.
Due to the area needed for the audio overlap, the movie picture resolution would fall under HD resolution if a 2K sensor is used.
It would be best to use a monochromatic image sensor, because of the lack of a Bayer-pattern and as a result of this, a better filling factor and anti-aliasing.
For variable density soundtracks a higher bit depth is needed, because the audio information is stored in the luminance. So a smaller bit depth is a lower precision quantisation of the audio signal.
An uneven background illumination causes hum in the audio signal, so it should be as even as possible.
The sensor which is used to digitise, should not have a fixed pattern noise and the optics should be clean and free of dust (even dust causes an audible periodic noise).
Some of the resulting artefacts can be reduced or completely removed, but it is still better to avoid them.