Relational schema#
The Merkle DAG of the Software Heritage archive is encoded in the dataset as a set of relational tables.
This page documents the relational schema of the latest version of the graph dataset.
Note: To limit abuse, some columns containing personal information are pseudonimized in the dataset using a hash algorithm. Individual authors may be retrieved by querying the Software Heritage API.
content: contains information on the contents stored in the archive.
sha1(string): the SHA-1 of the content (hexadecimal)sha1_git(string): the Git SHA-1 of the content (hexadecimal)sha256(string): the SHA-256 of the content (hexadecimal)blake2s256(bytes): the BLAKE2s-256 of the content (hexadecimal)length(integer): the length of the contentstatus(string): the visibility status of the content
skipped_content: contains information on the contents that were not archived for various reasons.
sha1(string): the SHA-1 of the skipped content (hexadecimal)sha1_git(string): the Git SHA-1 of the skipped content (hexadecimal)sha256(string): the SHA-256 of the skipped content (hexadecimal)blake2s256(bytes): the BLAKE2s-256 of the skipped content (hexadecimal)length(integer): the length of the skipped contentstatus(string): the visibility status of the skipped contentreason(string): the reason why the content was skipped
directory: contains the directories stored in the archive.
id(string): the intrinsic hash of the directory (hexadecimal), recursively computed with the Git SHA-1 algorithm
directory_entry: contains the entries in directories.
directory_id(string): the Git SHA-1 of the directory containing the entry (hexadecimal).name(bytes): the name of the file (basename of its path)type(string): the type of object the branch points to (eitherrev(revision),dir(directory) orfile(content)).target(string): the Git SHA-1 of the object this entry points to (hexadecimal).perms(integer): the permissions of the object
revision: contains the revisions stored in the archive.
id(string): the intrinsic hash of the revision (hexadecimal), recursively computed with the Git SHA-1 algorithm. For Git repositories, this corresponds to the commit hash.message(bytes): the revision messageauthor(string): an anonymized hash of the author of the revision.date(timestamp): the date the revision was authoreddate_offset(integer): the offset of the timezone ofdatecommitter(string): an anonymized hash of the committer of the revision.committer_date(timestamp): the date the revision was committedcommitter_offset(integer): the offset of the timezone ofcommitter_date, known ascommitter_date_offsetin swh-storagedirectory(string): the Git SHA-1 of the directory the revision points to (hexadecimal). Every revision points to the root directory of the project source tree to which it corresponds.
revision_history: contains the ordered set of parents of each revision. Each revision has an ordered set of parents (0 for the initial commit of a repository, 1 for a regular commit, 2 for a regular merge commit and 3 or more for octopus-style merge commits).
id(string): the Git SHA-1 identifier of the revision (hexadecimal)parent_id(string): the Git SHA-1 identifier of the parent (hexadecimal)parent_rank(integer): the rank of the parent, which defines the ordering between the parents of the revision
release: contains the releases stored in the archive.
id(string): the intrinsic hash of the release (hexadecimal), recursively computed with the Git SHA-1 algorithmtarget(string): the Git SHA-1 of the object the release points to (hexadecimal)date(timestamp): the date the release was createdauthor(integer): the author of the revisionname(bytes): the release namemessage(bytes): the release message
snapshot: contains the list of snapshots stored in the archive.
id(string): the intrinsic hash of the snapshot (hexadecimal), recursively computed with the Git SHA-1 algorithm.
snapshot_branch: contains the list of branches associated with each snapshot.
snapshot_id(string): the intrinsic hash of the snapshot (hexadecimal)name(bytes): the name of the branchtarget(string): the intrinsic hash of the object the branch points to (hexadecimal)target_type(string): the type of object the branch points to (eitherrelease,revision,directoryorcontent).
origin: the software origins from which the projects in the dataset were archived.
url(bytes): the URL of the origin
origin_visit: the different visits of each origin. Since Software Heritage archives software continuously, software origins are crawled more than once. Each of these “visits” is an entry in this table.
origin: (string) the URL of the origin visitedvisit: (integer) an integer identifier of the visitdate: (timestamp) the date at which the origin was visitedtype(string): the type of origin visited (e.ggit,pypi,hg,svn,git,ftp,deb, …)
origin_visit_status: the status of each visit.
origin: (string) the URL of the origin visitedvisit: (integer) an integer identifier of the visitdate: (timestamp) the date at which the origin was visitedtype(string): the type of origin visited (e.ggit,pypi,hg,svn,git,ftp,deb, …)snapshot_id(string): the intrinsic hash of the snapshot archived in this visit (hexadecimal).status(string): the integer identifier of the snapshot archived in this visit, eitherpartialfor partial visits orfullfor full visits.