10. Data access options#

There are many different types of data available from Trove and many different ways of accessing it. You can manually download some data, such as images, from Trove’s web interface. If you’re creating small, selective datasets, these manual methods might be all you need.

But what if you want to save all the results from a search, automate downloading of images and text, or create a pipeline to feed Trove data into a specific tool for analysis? In these sorts of cases, you need access methods that are reusable and extensible – methods that can be invoked using code and that deliver data in a machine readable format that computers can manipulate.

The Trove Application Programming Interface (API) is the main way of accessing machine-readable data using automated methods. Computer programs can request data from the API and have it delivered in a predictable, structured format. Using the API you can construct reusable data-processing workflows, and create datasets containing millions of items.

However, the Trove API does have a few gaps and inconsistencies. Sometimes there’s just no convenient way of getting the data you want. In these cases you might need to resort to screen scraping – a process of extracting structured data from regular web pages. Compared to API access, screen scraping tends to be inefficient and error prone. But it’s a handy technique when other methods fail. See, for example: HOW TO: Get information about the position of OCRd newspaper text.

It’s also possible that someone might have done all the work for you! There are a number of ready-made datasets available for you to download and explore.