Follow the reluctant adventures in the life of a Welsh astrophysicist sent around the world for some reason, wherein I photograph potatoes and destroy galaxies in the name of science. And don't forget about my website, www.rhysy.net



Wednesday, 6 August 2014

Hydrogen, Dinosaurs, And User Support

Now that the bitter cynical ranting is over, it's time for a much more positive note about the research I've been doing for the past year or so. You can, of course, now read the full, magisterial* paper online, if you want the gory** details.

* Read, "fairly interesting".
** Read, "dull".

This is the second part of the trilogy of posts relating to my latest paper. The first was about the peer-review process that is the backbone of modern science, the third (coming soon) will be about what we discovered. In this one, I'll explain a bit about how we detect galaxies, without which we wouldn't be able to do anything at all. Specifically, this is going to be a very pragmatic post about how we look at the data and find what's in it; if you're more interested in how we get the data, you should wait for the third installment. What I'm going to describe here amounts to just three paragraphs in the paper, which should hopefully give you some idea of the work involved in a publication.

For the purposes of this post, the one-line summary is that we point our radio telescope at the sky, tell it to look for hydrogen, and then make maps of the hydrogen over a small part of the sky. We can measure not only how bright the emission is and where it is, but also what frequency (or velocity) it's emitting at. So we don't just get a normal 2D map to look at - we get a 3D data cube to search. And that's not easy.


You Could Probably Train A Monkey To Do This, But Not A Computer

In fact, finding hydrogen is very boring. It is. It really, really is. That's because although our data is three-dimensional, there are very few good programs* to let us view it in 3D. Consequently our poor students have to spend weeks looking at things like this :

* The redoubtable ds9 is one, but its features are limited. VisIt is an alternative, but it's so feature-packed it comes with a 30 page "quick" start guide (surely an oxymoron) and I can't get it to do anything.



What you're looking at is our data converted into a series of 2D slices. Each bright blob is a hydrogen detection (we'll get to why they look like that in a moment). What's even worse than having to view it in this unnatural, uninteresting way, is that there's no way to mark the detections when you've found them. You just have to try and remember which one's you've found. And with a particularly galaxy-rich data set like this one, that's simply impossible.

That didn't stop my student from doing an absolutely freakin' awesome job of cataloguing them, but it did take her about six weeks to find them all.

Now you may be wondering, why not just write a program to detect the hydrogen automatically ? Wouldn't that be faster and better ? In fact, no - in this case computers are, for once, slower and worse. The reason is that humans have had millions of years of evolution which give us pattern recognition skills that are, frankly, frickin' awesome. I mean, c'mon, look at this :



... or this...


... or this...


Now, even though I didn't tell you what to look for, you almost certainly spotted the lion and the tiger essentially instantly. Did you spot the velociraptor before reading the caption ? Don't answer, that was a rhetorical question. The point is, even without any instructions, you can identify threats in what are (compared to hydrogen data cubes) incredibly complex scenes with sufficient speed that you could decide to run away, if necessary.

Puerile though the joke may be, the last scene is actually a very good, nay, excellent analogy for hydrogen detections. The velociraptor is quite hard to distinguish from the foliage, just as hydrogen sources can be very faint. Also, the scene is dominated by a young lady afflicted by back problems, who is far more obvious than the raptor. In real data, artificial radio sources from satellites, radar, wi-fi, mobile phones etc. etc. etc., are often much brighter than the galaxies we're trying to detect.

Which is why one night a group of enterprising REU students decided to remove the road signs advertising mobile phone coverage. I mean, c'mon, they were stuck on the Observatory gate, for heaven's sake. I woke up to find one lying outside my front door.
It's obvious to a human observer that the velociraptor is much more threatening than our meditating friend - not so for a computer. You might be able to write a program to detect danger by looking for, say, nasty big pointy teeth, but asking it to be able to classify any possible animal and make a threat assessment is really rather a tall order. Similarly, it's not too difficult to write a program that finds hydrogen, but it's a lot harder to write one that doesn't find a lot of rubbish as well.

There's one other aspect of the meme that makes it extra perfect. Even if you haven't grown up terrified of velociraptors thanks to watching Jurassic Park when you were ten years old, even if you've somehow never heard of a velociraptor (God help you, you don't know what you're missing), you can still identify it as an animal. It just looks like one. Now, since velociraptors are all dead, we could call it a false positive in our search for threats. Similarly, we sometimes find things that look exactly like hydrogen detections, but aren't real. That's why we take extra "follow-up" observations whenever we're unsure.

OK, so our super-smart monkey brains can spot danger really quickly, and don't need anyone to do any astonishingly complicated pattern-recognition programming. Huzzah ! But now imagine that your task was not simply to detect and run away from the predator, but to make a quick sketch of it first (gosh, this is a good metaphor, isn't it ?). That's the crux of the problem - not detecting the hydrogen, but recording it quickly.


"Blender ? That's Your Answer To Everything !"

For some time, I'd been tinkering in my spare time with ways to import astronomical FITS files into 3D modelling/animation software Blender. I knew from the first time I sat down and catalogued a data cube that the software we were using was maddeningly inefficient. You couldn't even copy the coordinates of a detection to a file once you found one - you had to type them out yourself. It was a totally ridiculous waste of time* to spend days or weeks looking at blobs on a screen instead of trying to do actual science. And I knew that if I could only import the data into Blender somehow, all my problems would be solved at a stroke - the tools I needed were already an intrinsic, fundamental part of Blender**.

* A.k.a. "character building", or more accurately, "soul destroying."
** That's NOT why I wanted FITS files in Blender though. I just thought it would look cool.

I would not describe myself as a professional programmer and I tend to think of writing code as an option of last resort. Way back at the start of my PhD, I had no idea how to use the Python scripting language that Blender uses. I started learning it as a side-project to import simulation data (which was relatively easy - we'll get to why in a minute) for a friend. That gave me enough Python knowledge to try importing FITS data in various ways. You can read about my earlier efforts here. Initially I was limited to importing only the brightest pixels in the data - crude, but enough to see the data in 3D.

The same data as in the above movie, but now (crudely) rendered in 3D. Right Ascension and Declination are just position on the sky. Almost all the galaxies here aren't well resolved by Arecibo spatially, so they look like blobs on the sky. But they are very well resolved in the third axis (velocity) since they're rotating, so they look like long, cigar-like blobs in 3D. This is much better in .glass format.

The problem is that it's relatively easy to import and display a few thousand or tens of thousands of discrete data points in Blender, but it's very much harder to import a few tens or hundreds of millions of data points that form a continuous volume - like a FITS file. Anything less than that might be OK for making pretty pictures, but wouldn't be "science quality" - for example if the display was limited to showing only the brightest hydrogen, we'd miss the faint stuff, which can be the most interesting. But if you display the weak emission, you generally have to display the noise in the data as well, and that means you need to display huge numbers of data points.

This is what one million dots look like. To do anything useful with astronomical data, we need some way to visualise at least one hundred million data points.

It took another year or so in Arecibo, playing around in my spare time, before I hit upon a more useful solution than displaying a bunch of dots*. Trying to create a million little dots in Blender would be difficult, but it can easily handle an image of one million little dots. So by slicing the data into a series of images and then mapping each image onto a virtual Blender object (a simple flat plane), it would be possible to display the data without having to remove all the faint, potentially very interesting emission.

*Another problem with this methods is that in Blender, the dots can't have any colour information in the realtime view - so very bright sources look the same as very faint ones.

Now, at this point I could cut a long story short, but I'm not going to. Why ? Because it was bloody difficult, that's why ! With still very limited knowledge of Python, I needed a proof-of-concept that this method could work. I used a program called kvis (which we normally use for viewing the data in 2D) to create images of slices of the data. It only outputs in the obscure .ppm format, so then I had to convert them into something Blender can read like .png. Then, as an initial test, I manually loaded a bunch of these images into Blender, using each one to control the transparency of the plane it was mapped on to. This is massively impractical - if your cube has 100 pixels on a side, it needs 100 slices to show everything, and loading each one is boring. But it worked.

Pretty convincing for a bunch on planes.
Now I was really getting somewhere - I could view volumetric data in realtime without needing to cut any of the faint emission. The concept was proven, so I began working on this in earnest in between supervising a summer student, who was usually busy searching the cube the hard way with kvis, and finally publishing the data from my thesis. First, I had to find a workaround for a major problem with Blender : viewing the data from behind. Somehow, what looks great from one angle just looks mwwurrrgh (technical term) from the reverse angle.

Bugger.
After a couple of weeks wrestling with this one, I eventually consulted the Blender forums and found that the workaround was to create a copy of the image planes and put them somewhere else. Making a copy somehow "resorts" the textures so they look fine from the opposite direction. So then I wrote a little script to automatically change which images should be displayed depending on the orientation of the viewpoint. Swivel around too far and it automatically changes the view to that of the image copies instead of the originals.

That made things almost useful. A great deal of struggling with Blender's Python reference eventually allowed me to load the images automatically, which was infinitely more practical than loading them one at a time. The reference guide is, unfortunately, fantastic if you basically know what you're doing but not a great source for tutorials, but after a lot of trial and error I was able to make it work.

"Finally", I also had to teach myself matplotlib to avoid having to use two other programs to convert the data. That was relatively simple since the documentation was much better. This still only gave me the very very basics of what I wanted, but it was working : I could load any FITS cube I damn well wanted and look at it as nature intended, in 3D, with a few mouse clicks.


That wasn't the end though. Looking at it in 3D is nice, but not very useful by itself - the axes of the data need labels ! For that, I had to learn how to convert the pixel values into coordinates, something I'd been perfectly happy to let other programs do by magic. I never wanted to go into the details of world coordinate systems or write a routine to decide where it would be best to place the tick marks. But without this, no-one would ever take it seriously, me included. It would be a cute little gimmick, nothing more. Eventually I figured that one out too.


That meant I could also click on a source and know exactly where it was on the sky. So now instead of manually writing the coordinates down, you just click on a source and Blender calculates the position for you. Even better, you can add an object to hide the source, so that you never forget which sources you'd detected. You can turn the "masks" on and off if they're causing problems. Although you can do all this with other programs, they're not interactive - you have to enter the box coordinates manually and then re-load the data, which is much, much slower.



The upshot of this is that instead of having to spend weeks tediously poring over a data cube finding galaxies, you can get that stage done in about a day. It's about 50 times faster than using kvis. Conclusion ? Rhys wins. I didn't put that in the paper though.


Epilogue

Fast-forward about a year and the prototype had been developed into an all-singing, all-dancing FITS viewer now called FRELLED. Now it could load files with a user non-hostile interface very reliably. Another year or so and it could load simulations, cross-reference NED and the SDSS, plot contours and automatically create animations to impress people. Tasks that previously took minutes now take seconds, tasks that took weeks now take hours. If I had a TARDIS, I'd go back in time to PhD-me and say, "Rhys ! Forget the data analysis ! Learn Python, you dolt !".

Worryingly, that probably is the first thing I'd do if given access to a time machine.

Poster for the 2013 AAS in Long Beach, California.
Currently, after about two years of on-off work, FRELLED is more or less "core complete". It does everything I need it to, plus a few things I need it to do now that I didn't need it to do before. It's also evolved to the point where (I hope) the user really doesn't need to know anything at all about Blender. It's gotten pretty complex, so it still throws a wobbly from time to time when someone does something unexpected, but generally, through much toil, it's pretty stable.


You may be wondering, well, was it worth it ? YES OF COURSE IT WAS YOU STUPID PETTY FOOL ! HOW DARE YOU QUESTION - err, by which I mean, yes definitely, because it's not just useful for the incredibly niche aspect of looking at hydrogen data cubes. That's certainly what it's best at, of course.



But it can do a lot more than that. In principle it can load in any volumetric data set. Here, for example, is an MRI scan of a banana flower that someone converted into a slice-by-slice 2D GIF, which, with a small amount of tweaking, I was able to reconstruct this back into 3D in FRELLED :



Of course, simulations are even more fun because we can watch things evolve with time. Here's one that went wrong because the galaxy melted :


So, FRELLED has made my day-to-day life a lot easier, giving me more time to watch YouTube (don't worry, that's just astronomer-speak for "do ground-breaking science") and consume copious amounts of tea. In the future, it may become even more useful. Current surveys have at most a few thousand hydrogen detections... pretty soon, thanks to new telescopes and instrumentation, we'll be in the era of hundreds of thousands of detections. Rather than rely on those slow, unreliable programs to detect the hydrogen, with FRELLED (or something like it) and maybe a bit of crowd-sourcing, potentially we could let humans do the pattern recognition they're so good at without eating up years of astronomer's valuable time. I call that a success.

No comments:

Post a Comment

Due to a small but consistent influx of spam, comments will now be checked before publishing. Only egregious spam/illegal/racist crap will be disapproved, everything else will be published.