I have been monitoring and evaluating alternative technologies for some time and a few months ago GBIF initiated the redevelopment of the processing routines. This current area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:
- Reduce the latency between a record changing on the publisher side, and being reflected in the index
- Reduce the amount of (wo)man-hours needed to coax through a successful processing run
- Improve the quality assurance by inclusion of
- Checking that terrestrial point locations fall within the stated country using shapefiles
- Checking coastal waters using Exclusive Economic Zones
- Rework all the date and time handling
- Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record
- Integrate checklists (taxonomic, nomenclatural and thematic) shared through the GBIF ECAT Programme to improve the taxonomic services, and the backbone ("nub") taxonomy.
- Provide a robust framework for future development
- Allow the infrastructure to grow predictably with content and demand growth
- Apache Hadoop: A distributed file system, and cluster processing using the Map Reduce framework
- GBIF are using the Cloudera distribution of Hadoop
- Sqoop: A utility to synchronize between relational databases and Hadoop
- Hive: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by Facebook. Hive gives SQL capabilities on Hadoop. [Full table scans on GBIF occurrence records reduce from hours to minutes]
- Oozie: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed then open-sourced by Yahoo!
The processing workflow looks like the following (click for full size):
The Oozie workflow is still being developed, but the workflow definition can be found here.