Software Heritage - Datastore Scrubber#
Tools to periodically checks data integrity in swh-storage, swh-objstorage
and swh-journal, reports errors, and (try to) fix them.
The Scrubber package is made of the following parts:
Checking#
Highly parallel processes continuously read objects from a data store, compute checksums, and write any failure in a database, along with the data of the corrupt object.
There is one “checker” for each datastore package: storage (postgresql and cassandra), journal (kafka), and object storage (any backends).
The journal is “crawled” using its native streaming; others are crawled by range,
reusing swh-storage’s backfiller utilities, and checkpointed from time to time
to the scrubber’s database (in the checked_range table).
Storage#
For the storage checker, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init tool:
$ swh scrubber check init storage --object-type snapshot --nb-partitions 65536 --name chk-snp
Created configuration chk-snp [2] for checking snapshot in datastore storage postgresql
Note
A configuration file is expected, as for most swh tools.
This file must have a scrubber section with the configuration of
the scrubber database. For storage checking operations, this
configuration file must also have a storage configuration section.
See the swh-storage documentation for more details on this. A
typical configuration file could look like:
scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
  cls: postgresql
  db: service=swh
  objstorage:
    cls: noop
Note
The configuration section scrubber_db has been renamed as
scrubber in swh-scrubber version 2.0.0
One (or more) checking worker can then be spawned by using the swh scrubber
check run command:
$ swh scrubber check run chk-snp
[...]
Object storage#
As with the storage checker, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init tool:
$ swh scrubber check init objstorage --object-type content --nb-partitions 65536 --name check-contents
Created configuration check-contents [3] for checking content in datastore objstorage remote
Note
A configuration file is expected, as for most swh tools.
This file must have a scrubber section with the configuration of
the scrubber database. For object storage checking operations, this
configuration file must have:
- a - storageconfiguration section if content ids are read from it (default)
- a - journalconfiguration section if content ids are read from a kafka content topic (require to use flag- --use-journalof the- swh scrubber check runcommand)
- an - objstorageconfiguration section targeting the object storage to check
See the swh-storage documentation, swh-objstorage documentation and swh-journal documentation for more details on this. A typical configuration file could look like:
scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
storage:
  cls: postgresql
  db: service=swh
  objstorage:
    cls: noop
journal:
   cls: kafka
   brokers:
      - broker1.journal.softwareheritage.org:9093
      - broker2.journal.softwareheritage.org:9093
      - broker3.journal.softwareheritage.org:9093
      - broker4.journal.softwareheritage.org:9093
   group_id: swh.scrubber
   prefix: swh.journal.objects
   on_eof: stop
objstorage:
  cls: remote
  url: https://objstorage.softwareheritage.org/
By default, an object storage checker detects missing and corrupted contents.
To disable detection of missing contents, use the --no-check-references
option of the swh check init command.
To disable detection of corrupted contents, use the --no-check-hashes
option of the swh check init command.
One (or more) checking worker can then be spawned by using the swh scrubber
check run command:
- if the content ids must be read from a storage instance 
$ swh scrubber check run check-contents
[...]
- if the content ids must be read from a kafka content topic of - swh-journal
$ swh scrubber check run check-contents --use-journal
[...]
Journal#
As with the other checkers, a checking configuration must be created before being able to spawn a number of checkers.
A new configuration is created using the swh scrubber check init tool:
$ swh scrubber check init journal --object-type directory --name check-dirs-journal
Created configuration check-dirs-journal [4] for checking directory in datastore journal kafka
Note
A configuration file is expected, as for most swh tools.
This file must have a scrubber section with the configuration of
the scrubber database. For journal checking operations, this
configuration file must also have a journal configuration section.
See the swh-journal documentation for more details on this. A typical configuration file could look like:
scrubber:
  cls: postgresql
  db: postgresql://localhost/postgres?host=/tmp/tmpk9b4wkb5&port=9824
journal:
   cls: kafka
   brokers:
      - broker1.journal.softwareheritage.org:9093
      - broker2.journal.softwareheritage.org:9093
      - broker3.journal.softwareheritage.org:9093
      - broker4.journal.softwareheritage.org:9093
   group_id: swh.scrubber
   prefix: swh.journal.objects
   on_eof: stop
One (or more) checking worker can then be spawned by using the swh scrubber
check run command:
$ swh scrubber check run check-dirs-journal
[...]
Recovery#
Then, from time to time, jobs go through the list of known corrupt objects, and try to recover the original objects, through various means:
- Brute-forcing variations until they match their checksum 
- Recovering from another data store 
- As a last resort, recovering from known origins, if any 
Reinjection#
Finally, when an original object is recovered, it is reinjected in the original data store, replacing the corrupt one.