swh.model.model module#
Implementation of Software Heritage’s data model
See Data model for an overview of the data model.
The classes defined in this module are immutable attrs objects and enums.
All classes define a from_dict class method and a to_dict
method to convert between them and msgpack-serializable objects.
- exception swh.model.model.MissingData[source]#
- Bases: - Exception- Raised by Content.with_data when it has no way of fetching the data (but not when fetching the data fails). 
- swh.model.model.KeyType#
- The type returned by BaseModel.unique_key(). 
- swh.model.model.freeze_optional_dict(d: None | Dict | ImmutableDict) ImmutableDict | None[source]#
- swh.model.model.generic_type_validator(instance, attribute, value)[source]#
- validates the type of an attribute value whatever the attribute type 
- swh.model.model.optimize_all_validators(cls, old_fields)[source]#
- process validators to turn them into a faster version … eventually 
- class swh.model.model.ModelObjectType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
- Bases: - _StringCompatibleEnum- Possible object types of Model object - CONTENT = 'content'#
 - DIRECTORY = 'directory'#
 - DIRECTORY_ENTRY = 'directory_entry'#
 - EXTID = 'extid'#
 - METADATA_AUTHORITY = 'metadata_authority'#
 - METADATA_FETCHER = 'metadata_fetcher'#
 - ORIGIN = 'origin'#
 - ORIGIN_VISIT = 'origin_visit'#
 - ORIGIN_VISIT_STATUS = 'origin_visit_status'#
 - PERSON = 'person'#
 - RAW_EXTRINSIC_METADATA = 'raw_extrinsic_metadata'#
 - RELEASE = 'release'#
 - REVISION = 'revision'#
 - SKIPPED_CONTENT = 'skipped_content'#
 - SNAPSHOT = 'snapshot'#
 - SNAPSHOT_BRANCH = 'snapshot_branch'#
 - TIMESTAMP = 'timestamp'#
 - TIMESTAMP_WITH_TIMEZONE = 'timestamp_with_timezone'#
 
- class swh.model.model.BaseModel[source]#
- Bases: - ABC- Base class for SWH model classes. - Provides serialization/deserialization to/from Python dictionaries, that are suitable for JSON/msgpack-like formats. - abstract property object_type: ModelObjectType#
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 - evolve(**kwargs) ModelType[source]#
- Alias to call - attr.evolve()on this object, returning a new object.
 - anonymize() ModelType | None[source]#
- Returns an anonymized version of the object, if needed. - If the object model does not need/support anonymization, returns None. 
 
- class swh.model.model.BaseHashableModel[source]#
- 
Mixin to automatically compute object identifier hash when the associated model is instantiated. - compute_hash() bytes[source]#
- Derived model classes must implement this to compute the object hash. - This method is called by the object initialization if the id attribute is set to an empty value. 
 - evolve(**kwargs) HashableModelType[source]#
- Alias to call - attr.evolve()on this object, returning a new object with its- idrecomputed based on the content.
 
- swh.model.model.HashableObject#
- alias of - BaseHashableModel
- class swh.model.model.HashableObjectWithManifest[source]#
- Bases: - BaseHashableModel- Derived class of BaseHashableModel, for objects that may need to store verbatim git objects as - raw_manifestto preserve original hashes.- raw_manifest: bytes | None = None#
- Stores the original content of git objects when they cannot be faithfully represented using only the other attributes. - This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model. 
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 
- class swh.model.model.Person(fullname: bytes, name: bytes | None, email: bytes | None)[source]#
- Bases: - BaseModel- Represents the author/committer of a revision or release. - Method generated by attrs for class Person. - fullname#
 - name#
 - email#
 - classmethod from_fullname(fullname: bytes)[source]#
- Returns a Person object, by guessing the name and email from the fullname, in the name <email> format. - The fullname is left unchanged. 
 
- exception swh.model.model.TimestampOverflowException[source]#
- Bases: - ValueError- Raised when trying to build - Timestampfrom a timestamp too far in the past or future
- class swh.model.model.Timestamp(seconds: int, microseconds: int)[source]#
- Bases: - BaseModel- Represents a naive timestamp from a VCS. - Method generated by attrs for class Timestamp. - seconds#
 - microseconds#
 - MIN_SECONDS = -62135510961#
 - MAX_SECONDS = 253402297199#
 - MIN_MICROSECONDS = 0#
 - MAX_MICROSECONDS = 999999#
 
- class swh.model.model.TimestampWithTimezone(timestamp: Timestamp, offset_bytes: bytes)[source]#
- Bases: - BaseModel- Represents a TZ-aware timestamp from a VCS. - Method generated by attrs for class TimestampWithTimezone. - timestamp#
 - offset_bytes#
- Raw git representation of the timezone, as an offset from UTC. It should follow this format: - +HHMMor- -HHMM(including- +0000and- -0000).- However, when created from git objects, it must be the exact bytes used in the original objects, so it may differ from this format when they do. 
 - classmethod from_numeric_offset(timestamp: Timestamp, offset: int, negative_utc: bool) TimestampWithTimezone[source]#
- Returns a - TimestampWithTimezoneinstance from the old dictionary format (with- offsetand- negative_utcinstead of- offset_bytes).
 - classmethod from_dict(time_representation: Dict | datetime | int) TimestampWithTimezone[source]#
- Builds a TimestampWithTimezone from any of the formats accepted by - swh.model.normalize_timestamp().
 - classmethod from_datetime(dt: datetime) TimestampWithTimezone[source]#
 - to_datetime() datetime[source]#
- Convert to a datetime (with a timezone set to the recorded fixed UTC offset) - Beware that this conversion can be lossy: - -0000and ‘weird’ offsets cannot be represented. Also note that it may fail due to type overflow.
 - classmethod from_iso8601(s)[source]#
- Builds a TimestampWithTimezone from an ISO8601-formatted string. 
 - offset_minutes()[source]#
- Returns the offset, as a number of minutes since UTC. - >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0000" ... ).offset_minutes() 0 >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0200" ... ).offset_minutes() 120 >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"-0200" ... ).offset_minutes() -120 >>> TimestampWithTimezone( ... Timestamp(seconds=1642765364, microseconds=0), offset_bytes=b"+0530" ... ).offset_minutes() 330 
 
- class swh.model.model.Origin(url: str, id: bytes = b'')[source]#
- Bases: - BaseHashableModel- Represents a software source: a VCS and an URL. - Method generated by attrs for class Origin. - url#
 - unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#
- Returns a unique key for this object, that can be used for deduplication. 
 - swhid() ExtendedSWHID[source]#
- Returns a SWHID representing this origin. 
 
- class swh.model.model.OriginVisit(origin: str, date: datetime, type: str, visit: int | None = None)[source]#
- Bases: - BaseModel- Represents an origin visit with a given type at a given point in time, by a SWH loader. - Method generated by attrs for class OriginVisit. - origin#
 - date#
 - type#
- Should not be set before calling ‘origin_visit_add()’. 
 - visit#
 
- class swh.model.model.OriginVisitStatus(origin: str, visit: int, date: datetime, status: str, snapshot: bytes | None, type: str | None = None, metadata: None | Dict | ImmutableDict = None)[source]#
- Bases: - BaseModel- Represents a visit update of an origin at a given point in time. - Method generated by attrs for class OriginVisitStatus. - origin#
 - visit#
 - date#
 - status#
 - snapshot#
 - type#
 - metadata#
 - unique_key() Dict[str, str] | Dict[str, bytes] | bytes[source]#
- Returns a unique key for this object, that can be used for deduplication. 
 - origin_swhid() ExtendedSWHID[source]#
 
- class swh.model.model.SnapshotTargetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
- Bases: - Enum- The type of content pointed to by a snapshot branch. Usually a revision or an alias. - CONTENT = 'content'#
 - DIRECTORY = 'directory'#
 - REVISION = 'revision'#
 - RELEASE = 'release'#
 - SNAPSHOT = 'snapshot'#
 - ALIAS = 'alias'#
 
- swh.model.model.TargetType#
- alias of - SnapshotTargetType
- class swh.model.model.ReleaseTargetType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
- Bases: - Enum- The type of content pointed to by a release. Usually a revision - CONTENT = 'content'#
 - DIRECTORY = 'directory'#
 - REVISION = 'revision'#
 - RELEASE = 'release'#
 - SNAPSHOT = 'snapshot'#
 
- swh.model.model.ObjectType#
- alias of - ReleaseTargetType
- class swh.model.model.SnapshotBranch(target: bytes, target_type: SnapshotTargetType)[source]#
- Bases: - BaseModel- Represents one of the branches of a snapshot. - Method generated by attrs for class SnapshotBranch. - target#
 - target_type#
 - check_target(attribute, value)[source]#
- Checks the target type is not an alias, checks the target is a valid sha1_git. 
 
- class swh.model.model.Snapshot(branches: None | Dict | ImmutableDict, id: bytes = b'')[source]#
- Bases: - BaseHashableModel- Represents the full state of an origin at a given point in time. - Method generated by attrs for class Snapshot. - branches#
 
- class swh.model.model.Release(name: bytes, message: bytes | None, target: bytes | None, target_type: ReleaseTargetType, synthetic: bool, author: Person | None = None, date: TimestampWithTimezone | None = None, metadata: None | Dict | ImmutableDict = None, id: bytes = b'', raw_manifest: bytes | None = None)[source]#
- Bases: - HashableObjectWithManifest,- BaseModel- Method generated by attrs for class Release. - name#
 - message#
 - target#
 - target_type#
 - synthetic#
 - author#
 - date#
 - metadata#
 - raw_manifest: bytes | None#
- Stores the original content of git objects when they cannot be faithfully represented using only the other attributes. - This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model. 
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 
- class swh.model.model.RevisionType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
- Bases: - Enum- GIT = 'git'#
 - TAR = 'tar'#
 - DSC = 'dsc'#
 - SUBVERSION = 'svn'#
 - MERCURIAL = 'hg'#
 - CVS = 'cvs'#
 - BAZAAR = 'bzr'#
 
- class swh.model.model.Revision(message: bytes | None, author: Person | None, committer: Person | None, date: TimestampWithTimezone | None, committer_date: TimestampWithTimezone | None, type: RevisionType, directory: bytes, synthetic: bool, metadata: None | Dict | ImmutableDict = None, parents: Tuple[bytes, ...] = (), id: bytes = b'', extra_headers: Iterable = (), raw_manifest: bytes | None = None)[source]#
- Bases: - HashableObjectWithManifest,- BaseModel- Method generated by attrs for class Revision. - message#
 - author#
 - committer#
 - date#
 - committer_date#
 - type#
 - directory#
 - synthetic#
 - metadata#
 - parents#
 - extra_headers#
 - raw_manifest: bytes | None#
- Stores the original content of git objects when they cannot be faithfully represented using only the other attributes. - This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model. 
 - check_committer(attribute, value)[source]#
- If the committer is None, checks the committer_date is None too. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 
- class swh.model.model.DirectoryEntry(name: bytes, type: str, target: bytes, perms)[source]#
- Bases: - BaseModel- Method generated by attrs for class DirectoryEntry. - name#
 - type#
 - target#
 - perms#
- Usually one of the values of swh.model.from_disk.DentryPerms. 
 - DIR_ENTRY_TYPE_TO_SWHID_OBJECT_TYPE = {'dir': ObjectType.DIRECTORY, 'file': ObjectType.CONTENT, 'rev': ObjectType.REVISION}#
 
- class swh.model.model.Directory(entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: bytes | None = None)[source]#
- Bases: - HashableObjectWithManifest,- BaseModel- Method generated by attrs for class Directory. - entries#
 - raw_manifest: bytes | None#
- Stores the original content of git objects when they cannot be faithfully represented using only the other attributes. - This should only be used as a last resort, and only set in the Git loader, for objects too corrupt to fit the data model. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 - classmethod from_possibly_duplicated_entries(*, entries: Tuple[DirectoryEntry, ...], id: bytes = b'', raw_manifest: bytes | None = None) Tuple[bool, Directory][source]#
- Constructs a - Directoryobject from a list of entries that may contain duplicated names.- This is required to represent legacy objects, that were ingested in the storage database before this check was added. - As it is impossible for a - Directoryinstances to have more than one entry with a given names, this function computes a- raw_manifestand renames one of the entries before constructing the- Directory.- Returns:
- (is_corrupt, directory)where- is_corruptis True iff some entry names were indeed duplicated
 
 
- class swh.model.model.BaseContent(status: str)[source]#
- 
Method generated by attrs for class BaseContent. - status#
 
- class swh.model.model.Content(sha1: bytes, sha1_git: bytes, sha256: bytes, blake2s256: bytes, length: int, status: str = 'visible', data: bytes | None = None, get_data: Callable[[], bytes] | None = None, ctime: datetime | None = None)[source]#
- Bases: - BaseContent- Method generated by attrs for class Content. - sha1#
 - sha1_git#
 - sha256#
 - blake2s256#
 - length#
 - status#
 - data#
 - get_data#
 - ctime#
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 - classmethod from_data(data, status='visible', ctime=None) Content[source]#
- Generate a Content from a given data byte string. - This populates the Content with the hashes and length for the data passed as argument, as well as the data itself. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 - with_data(raise_if_missing: bool = True) Content[source]#
- Loads the - dataattribute if- get_datais not- None.- This call is almost a no-op, but subclasses may overload this method to lazy-load data (eg. from disk or objstorage). - Parameters:
- raise_if_missing – if - True(default), raise- MissingDataexception if no data is attached to content object
 
 
- class swh.model.model.SkippedContent(sha1: bytes | None, sha1_git: bytes | None, sha256: bytes | None, blake2s256: bytes | None, length: int | None, status: str, reason: str | None = None, origin: str | None = None, ctime: datetime | None = None)[source]#
- Bases: - BaseContent- Method generated by attrs for class SkippedContent. - sha1#
 - sha1_git#
 - sha256#
 - blake2s256#
 - length#
 - status#
 - reason#
 - origin#
 - ctime#
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 - classmethod from_data(data: bytes, reason: str, ctime: datetime | None = None) SkippedContent[source]#
- Generate a SkippedContent from a given data byte string. - This populates the SkippedContent with the hashes and length for the data passed as argument. - You can use attr.evolve on such a generated content to nullify some of its attributes, e.g. for tests. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 
- class swh.model.model.MetadataAuthorityType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
- Bases: - Enum- DEPOSIT_CLIENT = 'deposit_client'#
 - FORGE = 'forge'#
 - REGISTRY = 'registry'#
 
- class swh.model.model.MetadataAuthority(type: MetadataAuthorityType, url: str, metadata: None | Dict | ImmutableDict = None)[source]#
- Bases: - BaseModel- Represents an entity that provides metadata about an origin or software artifact. - Method generated by attrs for class MetadataAuthority. - type#
 - url#
 - metadata#
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 
- class swh.model.model.MetadataFetcher(name: str, version: str, metadata: None | Dict | ImmutableDict = None)[source]#
- Bases: - BaseModel- Represents a software component used to fetch metadata from a metadata authority, and ingest them into the Software Heritage archive. - Method generated by attrs for class MetadataFetcher. - name#
 - version#
 - metadata#
 
- class swh.model.model.RawExtrinsicMetadata(target: ExtendedSWHID, discovery_date: Any, authority: MetadataAuthority, fetcher: MetadataFetcher, format: str, metadata: bytes, origin: str | None = None, visit: int | None = None, snapshot: CoreSWHID | None = None, release: CoreSWHID | None = None, revision: CoreSWHID | None = None, path: bytes | None = None, directory: CoreSWHID | None = None, id: bytes = b'')[source]#
- Bases: - BaseHashableModel- Method generated by attrs for class RawExtrinsicMetadata. - target#
 - discovery_date#
 - authority#
 - fetcher#
 - format#
 - metadata#
 - origin#
 - visit#
 - snapshot#
 - release#
 - revision#
 - path#
 - directory#
 - to_dict()[source]#
- Wrapper of attr.asdict that can be overridden by subclasses that have special handling of some of the fields. 
 - classmethod from_dict(d)[source]#
- Takes a dictionary representing a tree of SWH objects, and recursively builds the corresponding objects. 
 - swhid() ExtendedSWHID[source]#
- Returns a SWHID representing this RawExtrinsicMetadata object. 
 
- class swh.model.model.ExtID(extid_type: str, extid: bytes, target: CoreSWHID, extid_version: int = 0, payload_type: str | None = None, payload: bytes | None = None, id: bytes = b'')[source]#
- Bases: - BaseHashableModel- Method generated by attrs for class ExtID. - extid_type#
 - extid#
 - target#
 - extid_version#
 - payload_type#
 - payload#