As a general rule of thumb, while actively working with a dataset you should use whichever file format best suits the way you work. In most cases, this will be dictated by the software that you prefer to use. If you have some flexibility, perhaps because your software supports several formats or you are writing your own software, consider using an archive-suitable formats described below.
When you have finished working with a particular dataset, you should transform it to a more stable, standard format for archive. It is increasingly common to find old files which are completely unreadable now, just because the software that created them is no longer available.
Ideally, your archival format should be at least one of:
- readable using free tools (ideally plain text): so it can be accessed without a potentially-expensive license
-
a well-documented standard: so a wide variety of software is available to access it
-
a de facto standard in your research area: so the majority of researchers you share it with can be expected to have access to the right software
If possible, try to choose a format that allows you to describe and document the data directly within the file.
Category | Formats | Comments |
---|---|---|
Text | Plain text, HTML, Rich Text Format, Markdown/RST/Textile/etc. | |
PDF/A | Only use for scans or if page layout is critical | |
Tabular/numeric | Comma-/Tab-Separated Values, XML | Human-readable with just a text editor |
NetCDF, HDF5, FITS | Particularly good for complex or hierarchical data structures, and embedding metadata | |
Images | TIFF, PNG, JPEG2000 | Avoid GIF and standard JPEG |
Movies | MP4, Ogg Video | Prefer open codecs wherever possible |
Sound | FLAC, Ogg Audio | Prefer open codecs wherever possible |
See more examples from the UK Data Service |