Tuesday, February 10, 2015

Some Hard-Earned Principles for Managing Data and Experiment Code

This is where I'm at now, after 10 years in vision science and many painful experiences with messing up data at some unknown time in the past, or having to redo grueling Excel-based analyses. If you can start this way from the very beginning, you'll have a nicer time than I did.

A lot of this has to do with what I would call "personal forensics": can you reconstruct where the data came from? Can you find the code for a particular experiment, even 5 years later? Can you rerun the analysis and get the same result? But it should also help to avoid getting snarled up even in day-to-day work, e.g. moving code and data between development and experiment machines.

These are framed in single-user terms, but of course this stuff would ideally be thought of in terms of lab policy (and usually you should follow the policy of your lab rather than these if there's a conflict). It's also limited by my experience, which is primarily psychophysical and eyetracking experiments - there are probably many special principles that apply for massive data files, that have many processing steps before they turn into delicious chart and statistical output.

Also wanted to say that a number of these were inspired by my friend Tom Wallis's blog, such as this excellent post about how to organize your folders. Here we go:
  • On the top level of folders, organize by project (rather than type of file), keeping everything for a particular project together.
  • As much as possible, use text files for all the raw data output of your experiment - probably in comma separated format (those open automatically in Excel, if they have the extension .csv, and are easily imported by every analysis package). Then the first row will be descriptive column headings (no spaces). I use separate files for separate sessions, and have each row represent a trial.
  • Put lots of identifying info into the data file name (underscores rather than spaces) so that it makes sense on its own - at least the experiment title, the participant ID, and the complete date. Some people put the time in seconds too.
  • After each participant, add notes about the run, including anything that went wrong and any observations they made, to the end of a text file in the same folder as the results. (and labelling the entry with the same convention as the data file)
  • Don't ever touch the raw data files, except to fix errors in them (and make sure the original has been committed if so.)
  • A column with a timestamp for each trial down to the second (such as the 7 digit number returned by the Psychophysics Toolbox's GetSecs) can be helpful both for unambiguously identifying trials, and for evaluating the time course of an experiment.
  • Generate intermediate numerical files when necessary, also in text format, and store them in a different folder than the raw data.
  • Do all statistics and summarizing in scripts, in something like Python, Matlab or R - something very general, that has a lot of support and longevity. Matlab is probably the worst bet at the moment, since it is the only closed source of the three (and expensive), and many scientists and mathematicians are deciding they don't want to put their time into adding value to a closed program that is not available to everyone (and thereby increasing lock-in).
  • Also do all the figure generation in one of those languages, with a single script that can go all the way from the raw data to the final, publication-quality figures with no manual tweaking on your part. Also generate them at the proper scale.
  • Output the figures in PDF format (or some other vector art format, but PDF appears to have the most support).
  • Keep the analysis scripts in a separate folder from the experiment code.
  • Put all the files (except the pdfs or other bulky binary files), including experiment code, analysis code, and data, under some kind of version control, such as Git or Subversion. If you have any folders with any variation of "old version" in the title, you need version control.
  • Commit whenever there's a big significant change, or important fix. Commit raw data immediately upon collecting, so that the uncontaminated data is always safe. It may not be to the forensic standard of a court of law, but for your own auditing of your data it is plenty of protection.
  • As is the software engineering wisdom, only commit code that you have tested at least a little and feel confident that it runs (because it's harder to retroactively investigate code that doesn't run), and write sensible comments to go with the version.
  • When there needs to be multiple copies of code on multiple computers, make sure they're linked up properly with version control (in Git this requires the Clone command), and then resynced when necessary.
  • If you can, have all your stuff, from every phase of your career, on one hard drive. Apart from big video files or giant data files, with hard drives getting bigger all the time (and search getting faster and smarter) there's no reason not to keep it together, and it gives you a target for backup.
  • The bare minimum backing up of data is automated, daily backup, of everything, to a device you control. Anything less than this is incredibly risky! Ideally it will also be incremental (using something like Time Machine on the Mac, Windows Backup on WIndows, or rsync on Linux), meaning that you can "rewind" to versions on earlier dates. That protects against cases of data getting invisibly corrupted and then backed up.
  • It probably helps a bit to have lots of incidental data redundancy, such as DropBox and additional copies of the repository on different computers/USB keys (and any kind of possibly-unreliable university- or company-wide automated backup). However, don't let those redundancies relax your vigilance with regard to the first type of backup. Do not spend any time on the "copy two or three folders manually once in a while" backup strategy, as it is almost worthless.
  • On the other hand, having additional off-site backup is a very good idea - earthquakes might not be an issue where you live, but I've known several labs that have been burgled. One method is to rotate complete clones of the hard drive monthly, say, and take the unused drive to some other location. If privacy isn't a concern, then an internet-based backup service like Carbonite may be a good solution for this too.
This was originally an email to a friend who's starting his Ph.D. in Psychology, and Jim Davies said I should post it, so sorry it's just a huge unstructured list on scattered topics.

There are two big potential speedbumps that I see to adopting these: one is learning those programming tools that can automate manipulating text files - for example, how to paste together output files from multiple participants to make one aggregate file, including stripping off the header row and adding a column for the participant. This all goes with Tom Wallis's principle of, don't touch your data "by hand", which I agree with, and is necessary to achieve the ideal of automating all the steps to go from raw data to submission-quality figure and statistical output. The other is learning to use version control software such as Git, which definitely requires conceptual investment.

For some people, learning this way of working might just be too much of a pain, and not worth it relative to just getting on with the science, in whatever tools they are used to. But if you think you can get past those speedbumps, I firmly believe it will pay off bigtime, and keep paying off throughout your career. It should let you spend more of your time doing the important, fun parts of science - and less time on resizing a figure by hand for the 27th time, or late, panicky nights trying to figure out where those spreadsheet numbers came from.

No comments: