Skip to contents

Introduction

The integrated Rule Oriented Data System (iRODS) is an open source data management software. It is used by research organisations and governmental institutes and has an important role in their ever growing data services.

This package provides an R user interface with iRODS and implements the basic functionality of working with iRODS collections and data objects from R. Whith this package it is possible to create, modify, and remove collections and data objects from an iRODS Zone. For data objects it is also possible to add, remove and modify meta data in the so called attribute-value-unit (AVU) tripples. For more information see the iRODS documentation

This package is based on the python irodsclient and it uses reticulate to interact with this library.

All functions in this package start with the ri_ prefix, this prevents hiding functions from other packages and makes your code more readable.

prerequisites

This package requires that the python-irodsclient library is installed. It is developped and tested using python version >3. Use pip to install the library:

pip3 install python-irodsclient

This package uses the iRODS environment files in ~/.irods/, these files should contain the user credentials needed to connect to iRODS. So it is up to the user to set up their iRODS environment before he or she can use this package. If the user can use the so called ‘i commands’, this package should also function properly.

Zones, collections, data objects and meta data

To understand how to use this package a basic understanding of iRODS is needed. The short version to remember is: each iRODS system is a ‘zone’, in these zones data collections exists and each data collection contains many data objects. Each data object contains meta data in so called Attribute-Value-Unit (AVU) tripples. There is a longer version but that’s beyond the scope of this vignette, please see the iRODS documentation.

Zones

A Zone is an independent iRODS system which contains the data catalog and one or more iRODS servers. While an iRODS implementation can contain multiple zones, in this vignette we assume a single zone implementation.

Collections

A Zone can contain on or more collections and each collection can contain one more sub collections. Collections are a way to store data in a hierarchical structure like directories or folders.

This package uses a default collection, which you must set manually. If you manipulate (store/retrieve etc) your data objects, it is assumed that they exists in your default collection, unless you provide a specific collection in the function arguments.

Data objects

A Collection can contain multiple data objects, like files in a folder or directory. Although somewhat different, a data object can be considered as a normal file.

Meta data

Meta data is stored in AVU tripples. Many objects in iRODS, like users, collections, or resources can contain meta data but this package only considers meta data associated with data objects.

AVU triples must contain an attribute and value, unit is optional. Within iRODS it is possible to search and query data objects based on their values in the AVU triples. These search options are not yet implemented in this package.

Initialise an iRODS session

So, now that we know the basics let’s start with an iRODS session. We assume that your iRODS environment is set up properly and that you have the python iRODS client in place.

To start an iRODS session from within R you exexcute the following command. The irods_env_file usually points to ~/.irods/irods_environment.json.

library(ricmd)
ri_session(irods_env_file)

Executing this command creates a hidden environment wich contains the iRODS configuration and the session object. This session object refers to the session class from the pyton library and contains all the methods for working with iRODS.

After initiating the iRODS session one must set the default collection. In our case we use the home collection of the user and we asume that the variables irods_zone and irods_user contain the name of the iRODS zone and user respectively.

my_collection <- file.path("",irods_zone, "home", irods_user)
my_collection
#> [1] "/tempZone/home/rods"
ri_setCollection(my_collection)

The name of the collection always start with the name of the zone, in our case ‘/tempZone’ (mind the ‘/’ at the beginning). We use the standard collection from a default iRODS installation with a username ‘devel’, hence ‘home/devel’.

This default collection can be changed anytime in your script. If you want check which collection is the default collection, then use:

ri_getCollection()
#> [1] "/tempZone/home/rods"

Creating a new collection

A new data collection can be created with

test_collection <- file.path(my_collection, "tests")
ri_createCollection(test_collection)
#> Error in ri_createCollection(test_collection): ri_createCollection: collection allready exists

The only argument of ri_createCollection is the full path to the new collection. If this collection allready exists, an error will follow.

Since we created the new collection inside our default collection, we can show it using the ri_listSubCollections command:

ri_listSubCollections()
#> [1] "tests"

To remove a collection, the ri_removeCollection command can be used:

ri_removeCollection(test_collection)

Creating data objects

Data objects vs local files

Data objects in iRODS can be regarded as files (although, technically not completely true). One can store a local file from disk as data object into a data collection of iRODS. Also, a data object can be retrieved from iRODS and stored as file on the local disk. Storing and retrieving in iRODS are called put and get respectively.

By default the data object is stored in or retrieved from the default collection set by ri\_setCollection. An equivalent approach can be followed for files stored on disk. It is good practice to seperate data from code.This means that data is always written to a different location (directory) than your current working directory, which contains the code. Using the ri\_setDatadir function one can set the location where files are stored on the local file system and the ricmd package uses that directory by default.

To demonstrate this, we first create a temporary data directory and store a data file in that directory:


testDatadir <- file.path(tempdir(), 'data')
dir.create(testDatadir)

x <- rnorm(10)
saveRDS(x,file.path(testDatadir, "datafile.rds"))

We set this data directory as default location for local files

ri_setDatadir(testDatadir)

Store data objects

For storing data objects into iRODS you can use the ri_put command. In the next example we use the test collection created above as default collection.

ri_setCollection(test_collection)
ri_put(filename = file.path(testDatadir, "datafile.rds"))
#> Error in ri_put(filename = file.path(testDatadir, "datafile.rds")): ri_put: object allready exists and overwrite is FALSE

This stores the datafile.rds file as object in the default collection. To see a list of stored objects the ri_listDataObjects function is available

ri_listDataObjects()
#> [1] "datafile.rds"

Overwriting and existing object is not allowed, this results in an error. Using the overwrite=TRUE argument, overwriting of objects can be forced

# results in an error
try(ri_put(filename = file.path(testDatadir, "datafile.rds")))
#> Error in ri_put(filename = file.path(testDatadir, "datafile.rds")) : 
#>   ri_put: object allready exists and overwrite is FALSE

# this works
ri_put(filename = file.path(testDatadir, "datafile.rds"), overwrite = TRUE)

Now that the data is stored in iRODS you can remove the local file if you want.

unlink(file.path(testDatadir, "datafile.rds"))

Retreive data objects

To get (retrieve) the data object created in the last section, you use the ri_get() function. By default, this functions stores the file locally in the data directory defined above.

ri_get("datafile.rds")

And now you can read and process the data

x <- readRDS(file.path(testDatadir, "datafile.rds"))
print(length(x))
#> [1] 10

I you want to store the file in an other location, or using a different filename, you can use the datadir and filename arguments of this function

ri_get("datafile.rds", datadir = "/path/to/other/directory",
    filename = "differentFileName.rds")

The ri_get does not overwrite existing files by default. If you want to get the same data object more than once you either overwrite the existing file or rename it

# the file allready exists, so this results in an error
try(ri_get("datafile.rds"))
#> Error in ri_get("datafile.rds") : 
#>   ri_get: file allready exists in data directory and overwrite=FALSE

# overwrite te file
ri_get("datafile.rds", overwrite = TRUE)

# or use different file anme
ri_get("datafile.rds", filename = "datafile2.rds")

Manage meta data

In iRODS meta data, an AVU tripple, can be associated with users, collections and data objects. This package only supports meta data of data objects. Meta data of collections, users etc. is not supported.

An AVU triple must be an unique combination of attribute, value, and unit. If two tripples have the same attribute, the value must differ, or, if the tripples have the same attribute and value, then the unit must differ.

To add meta data to an object, you use the ri_metaAdd function. If we want to add meta data to the datafile.rds data object created above, we execute:

ri_metaAdd(object = "datafile.rds", attribute = "attr1", value = "val1")

In above command we assume that datafile.rds is in the default collection. For an AVU triple, you must provide an attribute and value. The unit is optional and defaults to NULL.

If you want to add meta data to an object in a different collection, you must provide the collection argument:

ri_metaAdd(object = "otherObject", attribute = "attr1", value = "val4",
           collection = "/zone/otherCollection")

With the ri_metaGet you can get all the meta data from an object as a data.frame with attribute, value, and units fields. The data.frame contains a row for each attribute existing in the meta data. The data.frame gives NA values for units which are NULL.

# first add some more meta data
ri_metaAdd(object = "datafile.rds", attribute = "attr2", value = "val2")
ri_metaAdd(object = "datafile.rds", attribute = "attr3", value = "val3", units = "unit1")

# get meta data
meta <- ri_metaGet("datafile.rds")
print(meta)
#>   attribute value units
#> 1     attr1  val1  <NA>
#> 2     attr2  val2  <NA>
#> 3     attr3  val3 unit1

You can also delete meta triples from an object.


ri_metaRemove("datafile.rds", attribute  = "attr3", value = "val3", units = "unit1")
#> NULL
meta <- ri_metaGet("datafile.rds")
print(meta)
#>   attribute value units
#> 1     attr1  val1  <NA>
#> 2     attr2  val2  <NA>