Roadmap 2023#
(Version 1.0, last modified 2023-03-13)
This document provides an overview of the technical roadmap of the Software Heritage initiative for the year 2023.
Live tracking of the roadmap implementation progress during the year is available from a dedicated GitLab board.
Collect#
Add support for write APIs features in GraphQL#
- Lead: jayesh 
- Priority: low 
Description:
Add support for write APIs in GraphQL (eg: an API for save code now) in order to cover 100% of the REST API features in the GraphQL API.
Includes work:
- Implement write APIs 
- Enforce authorization configuration for restricted access features 
KPIs:
- GraphQL coverage of 100% of the REST API in production 
Tooling for takedown notices#
- Lead: lunar 
- Priority: high 
Description:
Set up a workflow to handle takedown requests and improve automation capabilities of the sysadmin tools for takedown notices processing.
Includes work:
- Set up a specification for workflow integration in swh-web 
- Implement workflow integration 
- Set up technical specification for sysadmin tooling 
- Implement missing sysadmin tools (verification and automation) 
- Create a sysadmin documentation for takedown notices 
KPIs:
- Takedown notice handling integrated to swh-web 
- Automated sysadmin tools for takedown notices processing 
Automate add forge now#
- Lead: vsellier 
- Priority: low 
Description:
Set up automation capabilities on Add forge now to ease and facilitate the handling of Add forge now requests
Includes work:
- Automate ingestion process 
- Automate add forge now workflow 
- Setup and deploy automation process in staging 
- Deploy automation process in production 
KPIs:
- Automated Add forge now processing tools and wokflow in production 
Minimize archival lag w.r.t. upstream code hosting platforms#
- Lead: olasd 
- Priority: medium 
Description:
Improve ingestion efficiency Make lag monitoring dashboards easy to find (for decision makers)
Includes work:
- Implement git protocol V2 for Dulwich 
- Optimize scheduling policies 
- Optimize loaders 
KPIs:
- Number of out of date repos (absolute and per platform) 
- Total archive lag (e.g., in days) 
Extend archive coverage#
- Lead: ardumont 
- Priority: medium 
Description:
Add listers and loaders for not-yet-supported forges/package managers and VCS Listers and loaders can be developed in house or contributed by external partners, e.g., via dedicated grants.
Includes work:
- Validate public review and deploy Listers and loaders pending in staging (Arch, AUR, Crates, Packagist, Rubygems, Fedora, Puppet, Hackage, Golang, Bower, Nix/Guix, CVS, pub.dev) 
- Implement new listers and loader 
KPIs:
- Number of deployed listers 
- Number of deployed loaders 
Preserve#
Explore possibility of replacing SHA1 with SHA1-DC#
- Lead: olasd 
- Priority: high 
Description:
Mainstream platforms like GitHub now use SHA1-DC
Includes work:
- Study implications of aligning with the SHA1-DC adoption 
KPIs:
- Decision/blockers whether to move to SHA1-DC 
Regularly scrub journal, storage, and objstorage#
- Lead: vlorentz 
- Priority: medium 
Description:
Set up background jobs to regularly check - and repair when necessary - data validity, in all SWH data stores. This includes both blobs (swh-objstorage) and other graph objects (swh-storage) on all the copies (in-house, kafka, azure, upcoming mirrors, etc.)
Includes work:
- Implement storage scrubber for Cassandra 
- Add scrubbing for the object storage 
- Add metrics and Grafana dashboard for scrubbing process 
- Automatically repair and recover objects found to be invalid 
KPIs:
- List of scrubbers deployed in production 
- Monitoring tools deployed in production 
- Rolling report of operations per datastore including errors found and fixed at each iteration 
Publicly available standard for SWHID version 1#
- Lead: rdicosmo 
- Priority: high 
Description:
Publish a stable version of the SWHID version 1 specification, approved by a standard organization body.
Includes work:
- Publish publicly available standard 
- Start ISO normalization for SWHID V1 
KPIs:
- Published standard for SWHID version 1 
SWH Mirror at GRNET#
- Lead: douardda 
- Priority: medium 
Description:
Collaborate with GRNET to create a SWH Mirror
Includes work:
- Guidance and contribution to GRNET architecture and infrastructure choices 
- Specific developments if necessary (to be determined according to the chosen technical solutions) 
- Help to deployment 
KPIs:
- validated architecture and first POC 
SWH Mirror at Duisburg-Essen university#
- Lead: douardda 
- Priority: low 
Description:
Collaborate with Duisburg-Essen university to create a SWH Mirror
Includes work:
- Guidance and contribution to UniDue architecture and infrastructure choices 
- Specific developments if necessary (to be determined according to the chosen technical solutions) 
- Developments of tools for Winery replication (for Ceph-based object storage) 
- Help to deployment 
KPIs:
- validated architecture and first POC 
SWH Mirror at ENEA#
- Lead: douardda 
- Priority: high 
Description:
Collaborate with ENEA to create a SWH Mirror
Includes work:
- Finalize object storage copy 
- Configure the stack for the mirror public deployment 
KPIs:
- SWH Mirror deployed on ENEA infrastructure and publicly available 
Mirrors tooling#
- Lead: douardda 
- Priority: high 
Description:
Provide common features required the SWH mirrors
Includes work:
- Set up feature flags on the web app and test modules activation/deactivation 
- Implement fallback mechanism for objstorage 
- Dedicated CI for the mirroring stack 
KPIs:
- Common features available for specific mirrors instances 
Archive cold-copy at CINES via Vitam#
- Lead: douardda 
- Priority: medium 
Description:
Perform a first complete copy of the archive stored in Vitam @ CINES Maintain the copy up-to-date periodically (on a period TBD)
Includes work:
- Validate implementation of ORC format in Vitaam 
- Run a Proof of Concept 
- Run the complete copy @ CINES 
- Configure/schedule the copy update process 
KPIs:
- First copy stored in Vitam 
- Updates calendar defined 
Support archiving repositories containing SHA1 hash conflicts on blobs#
- Lead: olasd 
- Priority: high 
Description:
Enable the possibility to use multiple hash types for objects checksums in order to get rid of the limitations imposed by having SHA1 as a primary key for the object storage internally.
Includes work:
- Implement the remaining low-level layers (model and API are ready) 
KPIs:
- Multiple hash storage facility in production 
- Ability to archive git repos that contains sample SHAttered collisions blobs (they are currently detected and refused) 
Documentation#
Provide a landing page for docs.s.o#
- Lead: lunar 
- Priority: high 
Description:
Provide a user-friendly landing page for all documentation at docs.s.o, providing guidelines for each user type.
Includes work:
- Finalize and publish the landing page content 
- Improve the organization of the left-column menus 
KPIs:
- Landing page in production 
Technical debt#
Setup efficient and consistent swh-storage pagination#
- Lead: jayesh 
- Priority: high 
Description:
Define and implement an efficient structure for pagination in the data sources for swh-storage.
Pagination in the data sources (eg storage) is not very consistent and client friendly. Defining and implementing an efficient structure will be a good improvement. This will also involve re-factoring some clients.
Includes work:
- Design an efficient pagination architecture 
- Refactor obj-storage to implement the pagination 
- Identify and refactor existing clients that use swh-storage pagination 
KPIs:
- New pagination solution in production for swh-storage 
- Existing clients updated to use this solution 
Improve support for malformed git commits#
- Lead: vlorentz 
- Priority: high 
Description:
Improve the git loader to make it able to deal with edge-case commits that cause Dulwich to crash due to unnecessary data validation.
Includes work:
- Fix all crashes of the git loader caused by malformed git objects 
- Support commits whose “author” or “committer” field is missing 
KPIs:
- ratio of crashes on commits ingestion by the git loader (before/after) 
Tooling and infrastructure#
Dynamic infrastructure#
- Lead: vsellier 
- Priority: high 
Description:
Setup a dynamically scalable infrastructure for Software Heritage services
Includes work:
- Setup an elastic workers infrastructure 
- Configure Kubernetes clusters 
- Monitoring/Alerting solution for container-based services 
- Ingest the logs of the dynamic components into the current elk infrastructure 
KPIs:
- Dashboard displaying the status of the dynamic components - Number of listers running - Number of loaders running - RPC services status 
- Logs ingested and correctly parsed in kibana 
- Clusters fully backuped 
Use a common workflow management tool for swh-web#
- Lead: lunar 
- Priority: medium 
Description:
Find and integrate a common workflow management tool in swh-web for future modules that will require a workflow logic (takedown notices process, user support, etc.)
Includes work:
- Investigate the existing tools, measuring advantages and drawbacks for each 
- Integrate the most relevant tool in swh-web 
- Document the usage with a sample module 
KPIs:
- Integrated workflow tool, ready to use, in swh-web 
Provide a management-friendly monitoring dashboard of services#
- Lead: vsellier 
- Priority: high 
Description:
Provide a high-level and easy to find dashboard of running services with documented key indicators.
Includes work:
- Gather public site metrics 
- Publish and document a dedicated dashboard 
- Add links to it on common web applications (web app and docs.s.o) 
KPIs:
- Indicators available for public sites status 
- Indicators for archive workers status 
- Indicators for archive behavior 
- Main dashboard that aggregates the indicators 
- Dashboard referenced in common web applications 
Provenance in production#
- Lead: douardda 
- Priority: high 
Description:
Publish swh-provenance services in production, including revision and origin layers.
Includes work:
- Build and deploy content index based on a winnowing algorithm 
- Filter provenance pipeline to process only tags and releases 
- Setup a production infrastructure for the kafka-based revision layer (including monitoring) 
- Refactor and process the origin layer 
- Release provenance documentation 
KPIs:
- Provenance services available in production 
- % of archive covered 
Scale-out objstorage in production as primary objstorage#
- Lead: olasd 
- Priority: high 
Description:
Have the Ceph-based objstorage for SWH (Winery) in production as primary storage and set up equivalent MVP in staging (maybe use the same Ceph cluster for this)
Includes work:
- Deploy Ceph objstorage/Winery on CEA infrastructure 
- Benchmark Ceph-based objstorage 
- Switch to Ceph-based objstorage as primary storage 
- Handle Mirroring 
KPIs:
- Ceph-based obj-storage in production 
Cassandra in production as primary storage#
- Lead: vsellier 
- Priority: high 
Description:
Use Cassandra as primary storage in production, in replacement of PostgreSQL
Includes work:
- Finalize and validate the replayed data 
- Install the new bare metal servers for staging and production 
- Deploy a Cassandra-based production instance for tests 
- Benchmark the Cassandra infrastructure 
- Switch to Cassandra in production for primary storage 
KPIs:
- Replayed data validated 
- Live staging archive instance in parallel of the legacy postgresql instance 
- Live production archive instance in parallel of the legacy postgresql instance 
- Cassandra primary storage in staging 
- Cassandra primary storage in production 
Design and test a Continuous Deployment infrastructure#
- Lead: vsellier 
- Priority: medium 
Description:
Set up a Continuous Deployment infrastructure in order to improve bug detection and validate the future elastic infrastructure components
Includes work:
- Migrate away from Debian packaging for deployment (to pypi packages?) 
- Build a docker image per deployable service 
- Build the deployment tooling 
- Reset and redeploy the stack after commits 
- Execute acceptance tests 
- Identify if a deployment can be done by the ci or needs human interaction (mostly detect if a migration is present) 
- Integration tests 
KPIs:
- Docker image build triggered by a new version deployed in pypi 
- Docker image build by the CI 
- Component versions updated by the CI 
- Automatically redeployed staging on new release 
- Staging / whatever environment testing before pushing to production 
Design and test next generation CI Automation#
- Lead: olasd 
- Priority: low 
Description:
Design and tests solutions in order to improve the actual Continuous Integration tools to match the infrastructure evolutions and provide more features
Includes work:
- Actual CI state of the art and requirements specification 
- Evaluation of a migration from Jenkins to GitLab CI (and effective migration if relevant) 
- Code audit tools integration (static and/or dynamic analysis) 
KPIs:
- Gitlab CI used or tested in one or more sysadmin projects 
- Evaluation matrix (Pros/Cons) for a migration from jenkins to gitlab ci or other tool 
- Pros/Cons to deploy a code audit tool 
Graph export and graph compression in production#
- Lead: vlorentz 
- Priority: high 
Description:
Have the graph compression pipeline running in production with less then a month of lag Deployment, hosting and pipeline tooling
Includes work:
- Add JVM monitoring 
- Finish automation scripts 
- Deploy on a dedicated machine 
KPIs:
- Graph compression pipeline in production 
- Last update date / number of updates per year