Extrinsic metadata specification#
Extrinsic metadata is information about software that is not part of the source code itself but still closely related to the software. Typical sources for extrinsic metadata are: the hosting place of a repository, which can offer metadata via its web view or API; external registries like collaborative curation initiatives; and out-of-band information available at source code archival time.
Since they are not part of the source code, a dedicated mechanism to fetch and store them is needed.
This specification assumes the reader is familiar with Software Heritage’s Software Architecture and Data model.
Metadata sources#
Metadata fetchers#
Metadata fetchers are software components used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive.
A metadata fetcher is uniquely defined by these properties:
- its type 
- its version 
Examples:
- loaders, which may either discover metadata as a side-effect of loading source code, or be dedicated to fetching metadata. 
- listers, which may discover metadata as a side-effect of discovering origins. 
- deposit submitters, which push metadata to SWH from a third-party; usually at the same time as a software artifact 
- crawlers, which fetch metadata from an authority in a way that is none of the above (eg. by querying a specific API of the origin’s forge). 
Storage API#
Artifact metadata#
Data model#
The storage database stores metadata on origins, and all software artifacts supported by the data model. They are represented using this structure (simplified Python code):
class RawExtrinsicMetadata(HashableObject, BaseModel):
    object_type = "raw_extrinsic_metadata"
    # target object
    target: ExtendedSWHID
    # source
    discovery_date: datetime.datetime
    authority: MetadataAuthority
    fetcher: MetadataFetcher
    # the metadata itself
    format: str
    metadata: bytes
    # context
    origin: Optional[str] = None
    visit: Optional[int] = None
    snapshot: Optional[CoreSWHID] = None
    release: Optional[CoreSWHID] = None
    revision: Optional[CoreSWHID] = None
    path: Optional[bytes] = None
    directory: Optional[CoreSWHID] = None
    id: Sha1Git
The target may be:
- a regular core SWHID, 
- a SWHID-like string with type - oriand the SHA1 of an origin URL
- a SWHID-like string with type - emdand the SHA1 of an other- RawExtrinsicMetadataobject (to represent metadata on metadata objects)
id is a sha1 hash of the RawExtrinsicMetadata object itself;
it may be used in other RawExtrinsicMetadata as target.
discovery_date is a Python datetime.
authority must be a dict containing keys type and url, and
fetcher a dict containing keys name and version.
The authority and fetcher must be known to the storage before using this
endpoint.
format is a text field indicating the format of the content of the
metadata byte string, see extrinsic-metadata-formats.
metadata is a byte array.
Its format is specific to each authority; and is treated as an opaque value
by the storage.
Unifying these various formats into a common language is outside the scope
of this specification.
Finally, the remaining fields allow metadata can be given on a specific artifact within a specified context (for example: a directory in a specific revision from a specific visit on a specific origin) which will be stored along the metadata itself.
For example, two origins may develop the same file independently; the information about authorship, licensing or even description may vary about the same artifact in a different context. This is why it is important to qualify the metadata with the complete context for which it is intended, if any.
The allowed context fields for each target type are:
- for - emd(extrinsic metadata) and- ori(origin): none
- for - snp(snapshot):- origin(a URL) and- visit(an integer)
- for - rel(release): those above, plus- snapshot(the core SWHID of a snapshot)
- for - rev(revision): all those above, plus- release(the core SWHID of a release)
- for - dir(directory): all those above, plus- revision(the core SWHID of a revision) and- path(a byte string), representing the path to this directory from the root of the- revision
- for - cnt(content): all those above, plus- directory(the core SWHID of a directory)
All keys are optional, but should be provided whenever possible. The dictionary may be empty, if metadata is fully independent from context.
In all cases, visit should only be provided if origin is
(as visit ids are only unique with respect to an origin).
Storage API#
The storage API offers three endpoints to manipulate origin metadata:
- Adding metadata: - raw_extrinsic_metadata_add(metadata: List[RawExtrinsicMetadata]) - which adds a list of - RawExtrinsicMetadataobjects, whose- metadatafield is a byte string obtained from a given authority and associated to the- target.
- Getting all metadata: - raw_extrinsic_metadata_get( target: ExtendedSWHID, authority: MetadataAuthority, after: Optional[datetime.datetime] = None, page_token: Optional[bytes] = None, limit: int = 1000, ) -> PagedResult[RawExtrinsicMetadata]: - returns a list of - RawExtrinsicMetadatawith the given- targetand from the given- authority. If- afteris provided, only objects whose discovery date is more recent are returnered.- PagedResultis a structure containing the results themselves, and a- next_page_tokenused to fetch the next list of results, if any
Extrinsic metadata formats#
Formats are identified by an opaque string. When possible, it should be the MIME type already in use to describe the metadata format outside Software Heritage. Otherwise it should be unambiguous, printable ASCII without spaces, and human-readable.
Here is a list of all the metadata format stored:
- application/vnd.github.v3+json
- The metadata is the response of an API call to GitHub. 
- cpan-release-json
- The metadata is the response of an API call to CPAN’s /release/{author}/{release} endpoint. 
- gitlab-project-json
- The metadata is the response of an API call to Gitlab’s /api/v4/projects/:id endpoint. 
- gitea-repository-json
- The metadata is the response of an API call to Gitea’s /api/v1/repos/{owner}/{repo} endpoint. 
- gogs-repository-json
- The metadata is the response of an API call to Gogs’s /api/v1/repos/:owner/:repo endpoint. 
- pypi-project-json
- The metadata is a release entry from a PyPI project’s JSON file, extracted and re-serialized. 
- replicate-npm-package-json
- ditto, but from a replicate.npmjs.com project 
- nixguix-sources-json
- ditto, but from https://nix-community.github.io/nixpkgs-swh/ 
- original-artifacts-json
- tarball data, see below 
- sword-v2-atom-codemeta
- XML Atom document, with Codemeta metadata, as sent by a deposit client, see the Deposit protocol reference. 
- sword-v2-atom-codemeta-v2
- Deprecated alias of - sword-v2-atom-codemeta
- sword-v2-atom-codemeta-v2-in-json
- Deprecated, JSON serialization of a - sword-v2-atom-codemetadocument.
- xml-deposit-info
- Information about a deposit, to identify the provenance of a metadata object sent via swh-deposit, see below 
Details on some of these formats:
original-artifacts-json#
This is a loosely defined format, originally used as a metadata column
on the revision table that changed over the years.
It is a JSON array, and each entry is a JSON object representing an archive (tarball, zipball, …) that was unpackaged by the SWH loader before loading its content in Software Heritage.
When writing this specification, it was stabilized to this format:
[
   {
      "length": <int>,
      "filename": "<original filename>",
      "checksums": {
          "sha1": "<hex-encoded string>",
          "sha256": "<hex-encoded string>",
      },
      "url": "<URL the archive was downloaded from>"
   },
   ...
]
Older original-artifacts-json were migrated to use this format,
but may be missing some of the keys.
xml-deposit-info#
Deposits with code objects are loaded as their own origin, so we can look them up in the deposit database from their metadata (which hold the origin as a context).
This is not true for metadata-only deposits, because we don’t create an origin for them; so we need to store this information somewhere. The naive solution would be to insert them in the Atom entry provided by the client, but it means altering a document before we archive it, which potentially corrupts it or loses part of the data.
Therefore, on each metadata-only deposit, the deposit creates an extra “metametadata” object, with the original metadata object as target, and using this format:
<deposit xmlns="https://www.softwareheritage.org/schema/2018/deposit">
    <deposit_id>{{ deposit.id }}</deposit_id>
    <deposit_client>{{ deposit.client.provider_url }}</deposit_client>
    <deposit_collection>{{ deposit.collection.name }}</deposit_collection>
</deposit>