…

CTA’s Archive from Open Access to Open Data

Israel Bionyi Nyoh

Background

Since late 2013 the GODAN partners CABI and CTA (Technical Centre for Agricultural and Rural Cooperation) have been working to migrate their CTA Virtual Resource Center, Anancy, to CGSpace. CGSpace is a DSpace repository used by CGIAR (Consultative Group for International Agricultural Research) and their partners. CGSpace is currently home to around 50,000 items with 40,000 being open access. Anancy’s collection is now one of the many collections inside CGSpace. All of the Anancy data is stored on CGSpace using Dublin Core standards and partner specific metadata fields. The platform provides open access to metadata for research output. With recent improvements, the platform provides access to true machine readable open metadata as XML and JSON.

The DSpace software is open source and is used by, amongst others, academic and non-profit organisations. It can provide open access to a range of digital content including text, images, PDF, video and audio.

Accessing the materials in CGSpace that come from CTAs Anancy resource is easy to do and does not require a login. You can search and browse through all of the CTA Anancy via a web browser or programmatically using an API.

Securing a self-registered login enables additional functionality such as access to request email alerts for updated content.

DSpace & CGSpace basic hierarchical structure

Understanding of data structure is important for those wanting to extract content for their own collections and use in their own sites.

  • Items: An item includes metadata such as author and title and access to binary files such as PDF or audio files.
  • Collections: items are stored in a collections
  • Community: collections are stored in communities. Communities can also be grouped into other communities in a hierarchical structure

Direct access to content via the REST API

You may now use a REST API for open access to all of the materials and the metadata as JSON or XML. This can be used to enable integration of the Anancy content, or other collections, with other datasets. By accessing the data via the CGSpace API it enables developers to interrogate the raw data, to drill down through the hierarchy of sub-communities and collections to all of the items originally stored on the Anancy website. The REST API on CGSpace is open for access by all; there is no need to have an account, paid or otherwise.

This guide highlights the main endpoints that you can use to browse the Anancy collection. The full API documentation is available in the DSpace documentation.

Tip: Due to the way Chrome displays the data by default, it is a good choice for manually browsing the API

Expanding the data returned

The /communities, /collections, /items and /bitstreams endpoints, detailed below, all accept the optional expand parameter. Using this will expand certain sections of the data returned. You can tell which expand options are available by looking for the expand fields in the JSON or XML returned from the endpoint.

e.g.
<community>
<expand>parentCommunity</expand>
<expand>collections</expand>
<expand>all</expand>

</community>

This above snippet shows that you could choose to expand “parentCommunity” or “collections”. If you use the “all” option, the returned data will expand all available sections. So both the parent community and the collections in this example, but other sections may also be included and available for expansion, such as items, logos or licenses. You may also expand multiple sections at a time by separating the list with commas.

Example usage:

Since the DSpace 5 upgrade, there is now an alternative to the expand parameter for the more commonly used sections – communities, collections, items, metadata and bitstreams.

Endpoints for accessing CGSpace & Anancy data

https://cgspace.cgiar.org/rest/

/communities

Returns all of the communities in CGSpace

/communities/{communityIdentifier}

Returns a specific community. You can view all the identifiers using the /communities endpoint and examining the ID element. For example, the CTA identifier is 157

Communities may have sub-communities and collections below them. You can see these by using the “expand=subCommunities” or “expand=collections” parameter. e.g. https://CGSpace.cgiar.org/rest/communities/157?expand=subCommunities. Note, that this is case-sensitive.

The ID field within the sub-community element can be used as the identifier to look up the record for that community. You may then view details about the sub-community using the /communities endpoint, and optionally include the “expand=parentCommunity” parameter to see details about the parent. e.g. https://cgspace.cgiar.org/rest/communities/158?expand=parentCommunity

/communities/{communityIdentifier}/communities

(New in DSpace 5) Returns all of the sub-communities below the community identifier given. An alternative to /communities/{communityIdentifier?>expand=?subCommunities. Please note, there is a slight inconsistency between the two methods. i.e. “communities” as opposed to “subCommunities”.

/communities/{communityIdentifier}/collections

(New in DSpace 5) Returns all of the collections held in the community identifier given. An alternative to /communities/{communityIdentifier?>expand=?collections.

/communities/{communityIdentifier}/top-communities

(New in DSpace 5) Returns all top level communities

/collections

Returns all of the collections within CGSpace

/collections/{collectionIdentifier}

Returns the specified collection. You can get collection identifiers from the community endpoints from the ID field when oen of the options to expand collections is used

/collections/{collectionIdentifier}/items

(New in DSpace 5) Returns all the items for the specified collection. You can get collection identifiers from the /communities endpoints from the ID field when one of the options to expand collections is used. An alternative to /collections/{collectionIdentifier}?expand=items

/items/{itemIdentifier}

Returns the item for the specified identifier. You can get item identifiers from the /collections endpoint from the ID field when one of the options to expand items is used

/items/{itemIdentifier}/metadata

(New in DSpace 5) Returns all the metadata for the specified item. You can get item identifiers from the collection endpoints from the ID field when one of the options to expand collections is used. An alternative to /items/{itemIdentifier}?expand=metadata

/items/{itemIdentifier}/bitstreams

(New in DSpace 5) Returns all the metadata for the bitstreams/files of the specified item. You can get bitstream identifiers from the /items endpoints from the ID field when one of the options to expand bitstreams is used. An alternative to /items/{itemIdentifier}?expand=bitstreams

/bitstreams/{bitstreamIdentifier}

Returns the bitstream/file metadata for the specified identifier. You can get bitstream identifiers from the /items endpoints using the expand=items option and using the ID field.

/bitstreams/{bitstreamIdentifier}/retrieve

Returns the bitstream/file itself for the specified identifier. You can get bitstream identifiers from the /items endpoints using the expand=items option and using the ID field.

/bitstreams/{bitstreamIdentifier}/policy

(New in DSpace 5) Returns the policy for the specified bitstream. You can get bitstream identifiers from the /items endpoints using the expand=items option and using the ID field.

Requesting XML responses

By default, the API returns data in JSON format. If you would like to see the data as XML you will need to send that request in the header. A simple way to achieve this is using the popular command line program curl

e.g. Default, returns JSON:

curl -s https://cgspace.cgiar.org/rest/communities/157

e.g. Request JSON:

curl -s -H “Accept: application/json” https://cgspace.cgiar.org/rest/communities/157

e.g. Request XML:

curl -s -H “Accept: application/xml” https://cgspace.cgiar.org/rest/communities/157

“Handle”: an alternative to IDs

An alternative way to using the ID field to navigate through the hierarchy is using the handle data and the /handle endpoint. This is the same “handle” that you will see in the address bar when browsing the CGSpace content in a browser. A Github project shows more detail around using using the handle to navigate through the API data. You may then take the handle from any of the items to drilldown into those. If you take the handle from one of the items and you will then be able to access all of the metadata and the link you need to be able to access the file via the /bitstreams endpoint.

/handle/:handle

Returns all the collections for the specified identifier. You can get collection identifiers from the community endpoint from the ID field.

For example, view this page in a browser:

You will see the CTA community.

If you add /rest/ to the path and add the expand parameter:

You will see the same information in JSON format

Scroll down throug the sub-communities, and take the handle of one of those collections, change the expand parameter to “items” and you will see all of the metadata for that item and a link to any files in the bitstreams information in the retrieveLink field

Take the handle of one of those items, change the expand parameter to “all” and you will see all of the metadata for that item and a link to any files in the bitstreams information in the retrieveLink field

Access a bitstream link will then provide the file

Further information is available in the DSpace REST API documentation. The full DSpace documentation may also be of interest

OAI-PMH

The metadata has also been made available using OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting). As in the REST API, this makes use of Dublin Core. The Open Archive Initiative enables materials across the Web to be accessible through interoperable repositories to enable metadata to be published, shared and archived. An issued OAI-PMH request for data can be based on date stamp range and can also be restricted to defined named sets hierarchy.

The metadata made accessible by the OAI-PMH distinguishes between three entities:

  • Resource – An object on what the metadata is about
  • Item – A store for holding or generating metadata identified by a unique identifier. Items can be grouped into Sets to allow ‘Selective harvesting’ for example using datestamps to harvest only those records that were created, deleted, or modified within a specified date range
  • Record – A record is metadata in a single format. A record is returned in an XML-encoded byte stream in response to an OAI-PMH request for metadata from an item

To view the CGSpace data using OAI-PMH see https://CGSpace.cgiar.org/oai/request?verb=Identify

The ‘Identify’ verb is used to retrieve information about a repository such as :

  • Repository : the name for the repository. For example, CGSpace
  • BaseURL : the base URL specifies the host and port. For example, oai:CGSpace.cgiar.org:10568/1234
  • Protocol Version : the version of the OAI-PMH supported by the respository
  • Earliest Registred Data : this sets the limit for use of date stamps in the form YYYY-MM-DD hh:mm:ss
  • Granularity : this is expressed as a day granularity in the form YYYY-MM-DD hh:mm:ss
  • Deletion Mode : how deleted records are supported for example ‘No’, ‘Persistent’ or ‘Transient’ (Persistent enables a repository to keep information about deletions with no time limt)

From the Identify page, you can choose to also view datasetsrecordsidentifiers and detail about the metadata formats used.

Download a comprehensive PDF version of this technical article