swh.model.git_objects module#
Converts SWH model objects to git(-like) objects
Most of the functions in this module take as argument an object from
swh.model.model, and format it like a git object.
They are the inverse functions of those in swh.loader.git.converters,
but with extensions, as SWH’s model is a superset of Git’s:
- extensions of existing types (eg. revision/commit and release/tag dates can be expressed with precision up to milliseconds, to support formatting Mercurial objects) 
- new types, for SWH’s specific needs ( - swh.model.model.RawExtrinsicMetadataand- swh.model.model.ExtID)
- support for somewhat corrupted git objects that we need to reproduce 
This is used for two purposes:
- Format manifests that can be hashed to produce intrinsic identifiers 
- Write git objects to reproduce git repositories that were ingested in the archive. 
- swh.model.git_objects.content_git_object(content: Content) bytes[source]#
- Formats a content as a git blob. - A content’s identifier is the blob sha1 à la git of the tagged content. 
- swh.model.git_objects.directory_entry_sort_key(entry: DirectoryEntry)[source]#
- The sorting key for tree entries 
- swh.model.git_objects.escape_newlines(snippet)[source]#
- Escape the newlines present in snippet according to git rules. - New lines in git manifests are escaped by indenting the next line by one space. 
- swh.model.git_objects.format_date(date: Timestamp) bytes[source]#
- Convert a date object into an UTC timestamp encoded as ascii bytes. - Git stores timestamps as an integer number of seconds since the UNIX epoch. - However, Software Heritage stores timestamps as an integer number of microseconds (postgres type “datetime with timezone”). - Therefore, we print timestamps with no microseconds as integers, and timestamps with microseconds as floating point values. We elide the trailing zeroes from microsecond values, to “future-proof” our representation if we ever need more precision in timestamps. 
- swh.model.git_objects.normalize_timestamp(time_representation)[source]#
- Normalize a time representation for processing by Software Heritage - This function supports a numeric timestamp (representing a number of seconds since the UNIX epoch, 1970-01-01 at 00:00 UTC), a - datetime.datetimeobject (with timezone information), or a normalized Software Heritage time representation (idempotency).- Parameters:
- time_representation – the representation of a timestamp 
- Returns:
- a normalized dictionary with three keys: - timestamp: a dict with two optional keys: - seconds: the integral number of seconds since the UNIX epoch 
- microseconds: the integral number of microseconds 
 
- offset: the timezone offset as a number of minutes relative to UTC 
- negative_utc: a boolean representing whether the offset is -0000 when offset = 0. 
 
- Return type:
 
- swh.model.git_objects.directory_git_object(directory: Dict | Directory) bytes[source]#
- Formats a directory as a git tree. - A directory’s identifier is the tree sha1 à la git of a directory listing, using the following algorithm, which is equivalent to the git algorithm for trees: - Entries of the directory are sorted using the name (or the name with ‘/’ appended for directory entries) as key, in bytes order. 
- For each entry of the directory, the following bytes are output: 
 - the octal representation of the permissions for the entry (stored in the ‘perms’ member), which is a representation of the entry type: - b’100644’ (int 33188) for files 
- b’100755’ (int 33261) for executable files 
- b’120000’ (int 40960) for symbolic links 
- b’40000’ (int 16384) for directories 
- b’160000’ (int 57344) for references to revisions 
 
- an ascii space (b’ ‘) 
- the entry’s name (as raw bytes), stored in the ‘name’ member 
- a null byte (b’') 
- the 20 byte long identifier of the object pointed at by the entry, stored in the ‘target’ member: - for files or executable files: their blob sha1_git 
- for symbolic links: the blob sha1_git of a file containing the link destination 
- for directories: their intrinsic identifier 
- for revisions: their intrinsic identifier 
 
 - (Note that there is no separator between entries) 
- swh.model.git_objects.format_git_object_from_headers(git_type: str, headers: Iterable[Tuple[bytes, bytes]], message: bytes | None = None) bytes[source]#
- Format a git_object comprised of a git header and a manifest, which is itself a sequence of headers, and an optional message. - The git_object format, compatible with the git format for tag and commit objects, is as follows: - for each key, value in headers, emit: - the key, literally 
- an ascii space ( - \x20)
- the value, with newlines escaped using - escape_newlines(),
- an ascii newline ( - \x0a)
 
- if the message is not None, emit: - an ascii newline ( - \x0a)
- the message, literally 
 
 - Parameters:
- headers – a sequence of key/value headers stored in the manifest; 
- message – an optional message used to trail the manifest. 
 
- Returns:
- the formatted git_object as bytes 
 
- swh.model.git_objects.format_git_object_from_parts(git_type: str, parts: Iterable[bytes]) bytes[source]#
- Similar to - format_git_object_from_headers(), but for manifests made of a flat list of entries, instead of key-value + message, ie. trees and snapshots.
- swh.model.git_objects.format_author_data(author: Person, date_offset: TimestampWithTimezone | None) bytes[source]#
- Format authorship data according to git standards. - Git authorship data has two components: - an author specification, usually a name and email, but in practice an arbitrary bytestring 
- optionally, a timestamp with a UTC offset specification 
 - The authorship data is formatted thus: - `name and email`[ `timestamp` `utc_offset`] - The timestamp is encoded as a (decimal) number of seconds since the UNIX epoch (1970-01-01 at 00:00 UTC). As an extension to the git format, we support fractional timestamps, using a dot as the separator for the decimal part. - The utc offset is a number of minutes encoded as ‘[+-]HHMM’. Note that some tools can pass a negative offset corresponding to the UTC timezone (‘-0000’), which is valid and is encoded as such. - Returns:
- the byte string containing the authorship data 
 
- swh.model.git_objects.revision_git_object(revision: Dict | Revision) bytes[source]#
- Formats a revision as a git tree. - The fields used for the revision identifier computation are: - directory 
- parents 
- author 
- author_date 
- committer 
- committer_date 
- extra_headers or metadata -> extra_headers 
- message 
 - A revision’s identifier is the ‘git’-checksum of a commit manifest constructed as follows (newlines are a single ASCII newline character): - tree <directory identifier> [for each parent in parents] parent <parent identifier> [end for each parents] author <author> <author_date> committer <committer> <committer_date> [for each key, value in extra_headers] <key> <encoded value> [end for each extra_headers] <message> - The directory identifier is the ascii representation of its hexadecimal encoding. - Author and committer are formatted using the - Person.fullnameattribute only. Dates are formatted with the- format_offset()function.- Extra headers are an ordered list of [key, value] pairs. Keys are strings and get encoded to utf-8 for identifier computation. Values are either byte strings, unicode strings (that get encoded to utf-8), or integers (that get encoded to their utf-8 decimal representation). - Multiline extra header values are escaped by indenting the continuation lines with one ascii space. - If the message is None, the manifest ends with the last header. Else, the message is appended to the headers after an empty line. - The checksum of the full manifest is computed using the ‘commit’ git object type. 
- swh.model.git_objects.target_type_to_git(target_type: ReleaseTargetType) bytes[source]#
- Convert a software heritage target type to a git object type 
- swh.model.git_objects.snapshot_git_object(snapshot: Dict | Snapshot, *, ignore_unresolved: bool = False) bytes[source]#
- Formats a snapshot as a git-like object. - Snapshots are a set of named branches, which are pointers to objects at any level of the Software Heritage DAG. - As well as pointing to other objects in the Software Heritage DAG, branches can also be alias*es, in which case their target is the name of another branch in the same snapshot, or *dangling, in which case the target is unknown (and represented by the - Nonevalue).- A snapshot identifier is a salted sha1 (using the git hashing algorithm with the - snapshotobject type) of a manifest following the algorithm:- Branches are sorted using the name as key, in bytes order. 
- For each branch, the following bytes are output: 
 - the type of the branch target: - content,- directory,- revision,- releaseor- snapshotfor the corresponding entries in the DAG;
- aliasfor branches referencing another branch;
- danglingfor dangling branches
 
- an ascii space ( - \x20)
- the branch name (as raw bytes) 
- a null byte ( - \x00)
- the length of the target identifier, as an ascii-encoded decimal number ( - 20for current intrinsic identifiers,- 0for dangling branches, the length of the target branch name for branch aliases)
- a colon ( - :)
- the identifier of the target object pointed at by the branch, stored in the ‘target’ member: - for contents: their sha1_git 
- for directories, revisions, releases or snapshots: their intrinsic identifier 
- for branch aliases, the name of the target branch (as raw bytes) 
- for dangling branches, the empty string 
 
 - Note that, akin to directory manifests, there is no separator between entries. Because of symbolic branches, identifiers are of arbitrary length but are length-encoded to avoid ambiguity. - Parameters:
- ignore_unresolved – if False (the default), raises an exception when alias branches point to non-existing branches 
 
- swh.model.git_objects.raw_extrinsic_metadata_git_object(metadata: Dict | RawExtrinsicMetadata) bytes[source]#
- Formats RawExtrinsicMetadata as a git-like object. - A raw_extrinsic_metadata identifier is a salted sha1 (using the git hashing algorithm with the - raw_extrinsic_metadataobject type) of a manifest following the format:- target $ExtendedSwhid discovery_date $Timestamp authority $StrWithoutSpaces $IRI fetcher $Str $Version format $StrWithoutSpaces origin $IRI <- optional visit $IntInDecimal <- optional snapshot $CoreSwhid <- optional release $CoreSwhid <- optional revision $CoreSwhid <- optional path $Bytes <- optional directory $CoreSwhid <- optional $MetadataBytes - $IRI must be RFC 3987 IRIs (so they may contain newlines, that are escaped as described below) - $StrWithoutSpaces and $Version are ASCII strings, and may not contain spaces. - $Str is an UTF-8 string. - $CoreSwhid are core SWHIDs, as defined in SoftWare Heritage persistent IDentifiers (SWHIDs). $ExtendedSwhid is a core SWHID, with extra types allowed (‘ori’ for origins and ‘emd’ for raw extrinsic metadata) - $Timestamp is a decimal representation of the rounded-down integer number of seconds since the UNIX epoch (1970-01-01 00:00:00 UTC), with no leading ‘0’ (unless the timestamp value is zero) and no timezone. It may be negative by prefixing it with a ‘-’, which must not be followed by a ‘0’. - Newlines in $Bytes, $Str, and $Iri are escaped as with other git fields, ie. by adding a space after them. 
- swh.model.git_objects.extid_git_object(extid: ExtID) bytes[source]#
- Formats an extid as a gi-like object. - An ExtID identifier is a salted sha1 (using the git hashing algorithm with the - extidobject type) of a manifest following the format:- ` extid_type $StrWithoutSpaces [extid_version $Str] extid $Bytes target $CoreSwhid [payload_type $StrWithoutSpaces] [payload $ContentIdentifier] `- $StrWithoutSpaces is an ASCII string, and may not contain spaces. - Newlines in $Bytes are escaped as with other git fields, ie. by adding a space after them. - The extid_version line is only generated if the version is non-zero. - The payload_type and payload lines are only generated if they are not - None. $ContentIdentifier is the object ID of a content object.