A randomly selected image this morning - my old VW Eos | |

Camera: Panasonic DMC-GX7 | Date: 30-01-2015 17:00 | Resolution: 4894 x 3059 | ISO: 320 | Exp. bias: -66/100 EV | Exp. Time: 1/60s | Aperture: 5.0 | Focal Length: 17.0mm | Lens: LUMIX G VARIO PZ 14-42/F3.5-5.6 |

My programming project of the last few weeks has been to build my own “rolling portfolio”, which shows random images from my photographic portfolio as either a screensaver or a rolling display on a second monitor. I’ve implemented a number of features I’ve always wanted but never had from freeware/shareware options, like precise control over timing, the ability to quickly add a note if I see a required correction, and the ability to locate and review recent images if someone says “what was that picture you were just showing?”.

Having previous blogged about the poor quality of “random” algorithms in Android music player apps (seeĀ How Hard Can It Possibly Be?), I decided to put my money where my mouth is, and write my own preferred random algorithm. This does a recursive, random walk down the selected folder tree, until it either finds an image file, or a dead end (and then tries again). This was refreshingly easy to implement, and as expected runs quickly without needing any prior indexing of the content.

Also as expected, the simplest implementation returned a disproportionate number of hits (and therefore a lot of repeats) from folders with a very small number of images, but that was easily fixed by adding a “weighting” at the second stage of the walk, to reduce the number of hits on smaller portfolios.

Job done? Maybe. I started to notice that I still see the same image selected twice in quick succession, and sometimes more than twice over a day or two. At first I thought this might be an issue with seeding the random number generator, so that I was re-generating the same random sequences, but a quick check confirmed that wasn’t the problem. The next most obvious possibility (to me!) was an issue with the Microsoft .Net random() function, so I added some logging to the app, recording each random number, and then fed a day’s worth through some frequency analysis in Excel. That got Microsoft off the hook with a clean bill of health: there’s a slight preponderance of zeros, which I can explain, but otherwise the spread of results looks fine.

At the same time, I also added logging for the selected images themselves. In yesterday’s work hours operation the screen saver showed 335 images, of which no fewer than 21 were duplicates. Given that I have over 3500 images in the portfolio, this seems very high, but maybe not.

This is a known problem in mathematics, a generalisation of the “birthday problem”. It’s so known, because a common formulation is the question “given a room of people, what is the probability that at least two have the same birthday?”. While you need at 367 people to *guarantee *a duplicate, the counter-intuitive result is that with just 23 people in the room, it’s more likely than not. The generalised equation for the solution is the following:

*E = k – n + n(1 – 1/n) ^{k}*

In this n is the number of items, k is the number of random selections, and E is the expected number of duplicates. Feed in k = 335 and n = 3500, and you get the outcome E = 16. That’s close enough to my observed value of 21 (this is all random, so any one measurement might be either side of the expected value, but the order of magnitude is right). Couple this with the way my mind works, looking for patterns, and I must therefore expect to see some repetition. However it’s clear that the algorithm is working fine, it’s just the normal workings of probability.

Another implication of this is that as the sample grows, some images will naturally appear several times, and others may not appear at all. If we take 3500 samples, the expected number of duplicates rises to over 1200, so over 1/3 of the images will still be unselected.

Do I fix this? The relatively simple resolution is to keep a list of selected images, and use that to discard any selections which are repeats during a given period. However I would rather run this without a data store and maybe, now I can explain it, I’m comfortable. Time will tell.