Physicists of the Caribbean: Masochism For Fun And Science

How can you trust your own senses ? How can you be sure that what you're seeing isn't all just some kind of elaborate illusion set up by a powerful entity with a truly warped sense of humour ?

Don't worry, this isn't going to be much of a philosophical rant. Actually, it's going to be an extended rant about how I hurt myself by looking at over 170,000 pictures of static.

Yes, really. All of which relates, of course, to radio astronomy. It's time for the public-friendly explanation of my latest paper, in which I pit human vision against a set of different algorithms and find that there's life in us old monkeys yet. I also inflict things upon myself that I hope never to repeat ever ever again.

1) How To Hunt For Gassy Galaxies

Regular readers will be only too well aware by now of what galaxies look like in neutral hydrogen (HI - "H one") data cubes. There's a longer explanation here, but in general they look like this :

Here, as in the above GIF, we see slices of the data cube as separate images, though we can also view them all at once as a 3D volume. Now I've waxed lyrical on the joys of volumetric data visualisation many times, but not today. It turns out that the image sequence approach isn't as fun, but is usually better for sensitivity. Which is what I wanted to quantify in my paper.

So how do we actually decide what's a galaxy and what isn't ? Clearly, some bright signals are obviously real, but plenty of others are much fainter, and it isn't always obvious if they're from actual galaxies or just the result of noise.

We've got two options. The classic method is to do it by eye, visually trawling through the data cube slice-by-slice, image-by-image, recording the coordinates of wherever we find anything that just "looks like" a source by some criteria. The other, increasingly-popular approach, is to rely on algorithms to find the sources for us – maybe using people to check the algorithm's catalogues, but maybe even trusting the algorithms completely.

"Yes, yes," you say, "but how do you ever know if what you've found is real ?"

Indeed. That's a question that plagues both visual and automatic searches. Oh, sure, it's fine if you find a great big blazing beast of a galaxy (like some of those in the above animations), but what about when you've got something more piddly like this ? How could you be sure it wasn't just some slightly brighter-than-usual bit of noise ?

VC1_304, more usually known as NGC 4309. On the left is the source (highlighted by the white outline) as it appears when inspecting the data cube – it's barely visible even with training ! It's only a bit clearer in the spectrum, shown in the middle. Fortunately this object is optically very bright indeed (right), but most objects with this little gas tend to be much dimmer.

Well, there are a few measurements we can make of the gas itself that give us some clues : quantitative values can be a lot more reliable than simply eyeballing it. But first, in the above images you can also see optical data alongside the radio, and that's a very powerful verification check. Gas clouds without associated optical emission aren't non-existent, but they're extremely rare (about 1% of all HI detections by some estimates). You'd be forgiven for missing this one based only on looking at the data cube, but here at least the optical galaxy is unmistakable.

Understandably then, hardly anyone relies exclusively on the original HI data. We almost always have some optical data to act as independent confirmation (except where the Milky Way blocks our view in the aptly-named Zone of Avoidance) and can usually get follow-up radio observations to directly confirm at least the most interesting signals. That's the absolute gold-standard for verification, and how we know that most optically dark sources aren't real : we've checked them many, many times, and indeed still do when we find anything interesting.

But... what if we can't get either of these ? Exactly how good is the eye at distinguishing signal from noise, and what fraction of the faintest signals does it pick up at all ? That's what I wanted to answer with this paper.

2) Yes, But Why ? And How ?

The key aspect of the problem is that, once you get down to the faint stuff, there are no objective criteria, no magical algorithms, that can 100% reliably distinguish between real signals and noise. Some truly pathetic signals turn out to be real while some fairly convincing ones end up being discarded as worthless junk. The only way to be really sure is to do more observations.

BUT... algorithms are at least objective and repeatable : throw 'em the same data set and search with the same parameters, and you'll get the same objects every time. Different search methods or parameters can give you different catalogues, but at least if you keep everything the same, you'll get the same results again and again. That's a big advantage over using squishy, emotional humans that might get distracted because they haven't had enough tea or they got sick or their pet hamster died or something.

I thought about refining ChatGPT's weird take on this, but decided the bizarreness of putting the hamster in a box labelled NO TEA was just too funny to alter.

How much does this matter though, really ? Exactly how good are humans compared to algorithms ? Can we even quantify it, or are we all such an emotional, whimsical bunch of wet blankets that we just come up with totally different results every time ? Or if you're a Daily MFail reader, HAS WOKENESS KILLED ASTRONOMY ?

The only way I could see to test this was to look for lots and lots and lots of sources, all with different parameters. Throw enough sheer statistics at the problem and it ought to be possible to see if human abilities could be quantified or not.

Now to do this requires we have full knowledge of what's there for us to find. Ordinarily this isn't the case at all, because the whole point of the problem is that we don't know for sure which sources we've missed. So the only way to do this is by using fake sources – only then can we be absolutely sure if we've found everything. The basic idea is very simple : to try and find as many artificial signals as we can and measure their parameters.

Of course, for this we also need a data set which doesn't have any actual galaxies in it, otherwise we'll confuse the artificial signals with real ones. Fortunately one of our unpublished data sets includes just such a cube, spanning a frequency range in which real signals just can't happen. To find emission here would require galaxies moving towards us at insane velocities, many thousands of kilometres per second – no real galaxy is known which moves at anything even close to this*. Bingo ! We've got a real data cube with all the imperfections of real observational data, but with absolutely no real galaxies in it. Perfect.

* Redshifts of this magnitude are normal, due to the expansion of the Universe. But there the galaxies are moving away from us. Here we'd need galaxies of extreme blueshift, and there's no known mechanism by which this could happen.

Once we've done the detecting, how to parameterise the results ? Well, with any catalogue it's important to understand its completeness and reliability : that is, what fraction of the sources present it detected, and what fraction of its detections are real. With enough sources we could also see if the eye is especially sensitive to particular properties, like the total brightness and velocity width, and maybe also figure out what sort of false signals fool the eye into thinking there's something present. And I also wanted to test the street wisdom that, apart from speed, humans are generally better than algorithms when it comes to sheer detection capabilities.

3) The Experiment

Figuring out the best approach took a lot of trial and error. The simplest method would be to inject lots of signals into a single large data cube, but this wasn't feasible. This would mean I'd have to mask each galaxy as I went along to avoid cataloguing it twice, which is... not a huge amount of work, but it adds up. And for an experiment of the scale this one became, this would have been unbearable.

The problem is that galaxies themselves have two parameters which control their detectability : their width and their brightness. Here's an example spectrum I use in lectures :

What this is showing is a signal of fixed total flux but a varying velocity width. At the very beginning, all that flux is confined to just a few velocity channels, so it's very narrow but bright. Even though it's so bright, because of the way we typically display the data, the narrowness of the signal makes it hard to spot. As the movie advances the velocity width increases, so that it gets wider and wider but appears dimmer and dimmer. At first this makes it much easier to see : it's still bright but it's no longer narrow, so it's really obvious that there's something atypical here. But eventually that flux is spread out over so many channels that it's barely distinguishable from the background noise at all, even though the total amount of flux is the same throughout the animation.

I've long thought it an interesting question as to which one matters most. If a source is wide enough, does this compensate for its dimness ? Or is it brightness alone which determines detectability ? My PhD supervisor took it for granted it was the latter, but I was never quite convinced of this.

The only way I could see to tackle the problem was to inject many galaxies each of a given width and brightness. I'd inject, say, 100 with some combination of values, see how many I could find, and then repeat this ad nauseum. I'd need to have plenty of objects for each combination to get a statistically significant result. Since I had very little clue ahead of time where exactly the detectability threshold would be, this would mean injecting a lot of galaxies.

That made the idea of using a single cube a complete non-starter. Eventually I figured out a working strategy, which goes like this :

Pick a width and brightness (signal to noise, S/N) level of the signal.
Extract 100 small "cubelets" at random from the main cube.
For each cubelet, randomly inject (or not inject) a signal of the specified parameters, at a random location within each one.
Modify my source extraction program so I could go through each cube sequentially, just clicking on a source if I thought I could see one, or clicking outside the data set if I thought there wasn't one.
Choose new signal parameters and do the whole thing again.

The cubelet is an adorable animal, but, like tribbles, they tend to multiply exponentially if you're not careful.

This was... acceptably fast. Each set of 100 cubelets, containing on average 50 signals, takes about 30 minutes to catalogue. It also made it easy to take breaks, which was absolutely essential. I made it so that every time I clicked to identify a source, I'd be shown if I was correct or not, the result added to a catalogue, and the next cubelet would automatically open. The referee kindly let me get away with language not always typical of an academic paper because here it really does matter :

Without any of these it becomes too easy to be lost in the visual fog - one needs some clue as to what one is looking for or the experience is unendurably frustrating... Being a visual search process, one needs to take much more account of the psychological, emotional experience than in using a pure algorithm.

Finding the initial point was again a matter of trial and error. If I remember correctly, I believe my first guess for the source parameters was too bright so I ended up finding every source (or nearly so), so I kept halving the brightness until the sources became genuinely difficult to spot. From that point I could proceed systematically. The final result is this terrifying table :

Each of those pairs of values represents a search of 100 cubelets. In total I searched 8,500 cubelets, i.e. I looked at 170,000 individual images (slices of data), containing a total of 4,232 sources. But at last, I was done.

This was utterly exhausting. In principle the whole thing could be done in a week; in practice anyone actually trying that would likely claw their own eyes out and hurl them across the room in despair. In terms of calendar time it actually took several months (or... more), and isn't something I ever want to repeat. Fortunately, I'll probably never have to.

170,000 images. I mean, FFS.

4) The Results

The money plot from the paper is this innocuous figure :

Each black circle represents the search of 100 cubelets of a given combination of peak S/N and velocity width. The blue points with error bars show the median at different integrated S/N levels, essentially just a crude fit to the data.

This shows what fraction of sources are found as a function of their integrated S/N (signal to noise). This is a deceptively simple parameter (you can explore it a little more here) that measures how bright the sources are in a more sophisticated way than just the value of the brightest pixel : it accounts for the width of the galaxies as well. The remarkable thing is that the trend is so clear and so tight as a function of S/N_int even thought the experiment was done without any reference to this. Plotting completeness as a function of width or simple peak brightness just gives (more or less) a chart of pure scatter, but this trend emerges like magic. I really wasn't expecting this at all. A parameter designed originally for automated source-finding turns out to describe human vision remarkably well !

You can see that there's a critical threshold. Above 6.5 or so I detected near enough everything, wide or narrow. This value is something we've long known is a good measure of reliability (whether a source is real or not), but now it seems it's also a great way to determine if a sample is complete. If these results are a good reflection of how visual extraction works in the real world, it means we can be confident we've detected every source brighter than 6.5, and about half of all sources at a level of 3.9 or so.

Which by itself is already great news ! Now we can say confidently that above this threshold a) our source is real and b) we've detected all similar sources. In other words we can quantify exactly when our results are unbiased, and also what fraction we're likely to have missed at lower detection criteria.

Lovely. But is this a good approximation to real, in-anger source extraction ? Is the experiment sufficiently realistic ?

5) The Tests

Quite honestly I thought so. The referee, initially, didn't – though quite understandably. So back I went and did a whole bunch more tests just to make sure. Which resulted in this new and improved figure :

This one accounts for several different possible biases (feel free to skip ahead if you already believe the main result) :

Green points : these test for the fact that I knew ahead of time that the fraction of cubelets containing a source was always about 50%. Here I instead set this fraction to a random number, but lo, the detection statistics were unchanged. In fact even when I went back and searched cubes where I'd previously missed the source, but now with the full knowledge that a source was present, I still couldn't find them. Foreknowledge just doesn't help much.
Red points : randomised source properties. Here I injected cubelets with three different integrated S/N levels designed to give 25, 50 and 75% completeness levels, but with entirely randomised widths, in random order, without giving any indication of which identifications were correct until the experiment was over. Essentially for each cubelet I had no idea if it contained no source at all or one which would be marginally, modestly, or probably detectable, or what it would look like. This again made no difference to the results.
Orange points : as above, but now injecting the sources into a single large cube instead of a many cubelets. In the main experiment I knew there was at most one source per cubelet; here I didn't know the number injected (randomised with some sensible upper limit) or their properties. This was about as similar to real-world conditions as it's ever going to get, and it still made no difference.

So yes, after looking at another 918 cubes, I can categorically say that the experiment is realistic. There's one final possible bias, but I'll return to that only at the end.

6) Humans Versus Robots : FIGHT !

"It's a movie about a killer robot radio astronomer who travels back in time for some reason."

Okay, so this robustly quantifies how good visual source extraction can be. The obvious next question is : are we really any better off doing it ourselves, or should we let the algorithms loose instead ?

For this I tried two methods. First, I ran the same experiment again but now using automatic source-finding programs instead. Not quite identical though, because now the 80 GB or so of data was spaced over two machines, and adapting the programs to search many little cubes instead of one big one (as they're designed for) was just yawningly tedious. Instead, I had a script inject sources of a given S/N_int (with randomised widths and peak S/N) throughout the main master empty cube, run the extractors, increase the S/N_int, and iterate this until I'd sampled the whole range of values I'd tested visually.

I'm glossing over a lot of details here, but the end result is another set of unshapely curves :

"Fiducial" is just the main visual experiment. It sounds more sciency than "the one what I did earlier".

Let's start with the thick lines. After much experimentation, these were the settings that gave me the highest completeness rates I could get. With the deservedly popular SoFiA (red), this gives marginally better results than visual, though well within the general scatter. GLADoS, my own obscure code I wrote years ago, does considerably worse – not awful, but it looks a bit pathetic next to SoFiA.

But does this mean that algorithms are actually as good, or even slightly better, than humans ? Actually no, not really. Sort of. Maybe. It's complicated.

The thing is that these curves are all at some given reliability level. It's actually not that easy to show how reliability varies with S/N, but luckily there's an easier way to compare : just count the number of false positives from each method in the same volume. From all the earlier testing, the highest number of spurious signals I found through visual searching of this part of the data was 23. SoFiA, by contrast, found 133, whereas GLADoS found 176.

So that high completeness of SoFiA comes with a huge penalty in reliability. In terms of completeness and reliability, automatic extractors are nowhere near as good as people.

But, they do have a significant compensating advantage : sheer unbridled speed. Even if you have to search a long candidate list from the unreliable algorithms, this can still be faster than visual searching the whole cube. The balance to all this is hard to judge, but there will be a point of diminishing returns – try and dig too deeply into the noise with an algorithm and you'll get so many false sources that the speed advantage will be lost.

Bottom line ? For small cubes use people. Human extraction isn't that slow, it's better than the automated methods (though there's no harm running these as well), and its results are quantifiable. For larger cubes there isn't really much choice but to use algorithms, but at least this experiment gives a rough guide as to how good they can be.

Conclusions and Caveats

Human source extraction is seriously powerful. Over on Little Physicists I've been exploring the wonders of GPT-5, which honestly I've been seriously impressed by for scientific analysis and discussion. But it doesn't hold a candle to billions of years of evolution driving human pattern recognition skills. On that front, for the moment, humans win hands down.

After all, the question "real or fake" ? has caused much debate elsewhere. Methinks I ran the wrong experiment.

Interestingly, both visual and automatic techniques tend to pick up on the same sources : it's just that the automatic methods tend to a) miss a lot and b) also find a bunch of crap. Still, that does suggest they're doing something right. Keep dialling up their sensitivity and they'll probably find even more faint sources, it's just that the reliability penalty would be unbearable.

And that's the fun part. How exactly does human vision do so much better at rejecting all the really faint false stuff ? For now we don't know. There are many possibilities, but after looking at this many god damn fecking stupid smudgy feckless bloody false galaxies... I don't intend to investigate this any time soon.

More pragmatically, these results mean we can now say when our results are statistically meaningful, and also give us something to ram down people's throats if they ever dare to suggest that visual extraction isn't any good. But they also show that there's something about human vision, the thing we generally rely on the most for making sense of the world, that we don't fully understand.

And there's one final head-scratcher to end on. That other bias I mentioned ? The referee suggested that maybe in the real world we wouldn't catalogue such faint signals because we wouldn't be able to verify them. "Humph !" I thought, a tad disgusted by such an insinuation. So I ran yet another test, this time injecting fake sources into data cubes also containing real sources. The idea was that if we had missed a lot of real sources in our original search, in this new search we should find the fake sources plus some more real candidates.

We didn't. The fake sources were found at the same rates as before, so the search was just as effective as in the other tests... but hardly any new candidates for real detections. Great ! But... the odd thing is that the real sources are all much brighter than what we should be able to find. Where have all the faintest sources gone ? Not a clue ! Maybe, eventually, we'll get some actual proper science out of all this as well as all these dry statistics. Well, you never know.

Pages

Monday, 25 August 2025

Masochism For Fun And Science

No comments:

Post a Comment