swh.loader.core.loader module#
- class swh.loader.core.loader.BaseLoader(storage: StorageInterface, origin_url: str, logging_class: str | None = None, save_data_path: str | None = None, max_content_size: int | None = None, lister_name: str | None = None, lister_instance_name: str | None = None, metadata_fetcher_credentials: Dict[str, Dict[str, List[Dict[str, str]]]] | None = None, create_partial_snapshot: bool = False)[source]#
- Bases: - object- Base class for (D)VCS loaders (e.g Svn, Git, Mercurial, …) or PackageLoader (e.g PyPI, Npm, CRAN, …) - A loader retrieves origin information (git/mercurial/svn repositories, pypi/npm/… package artifacts), ingests the contents/directories/revisions/releases/snapshot read from those artifacts and send them to the archive through the storage backend. - The main entry point for the loader is the - load()function.- 2 static methods ( - from_config(),- from_configfile()) centralizes and eases the loader instantiation from either configuration dict or configuration file.- Some class examples: - SvnLoader
- GitLoader
- PyPILoader
- NpmLoader
 - Parameters:
- lister_name – Name of the lister which triggered this load. If provided, the loader will try to use the forge’s API to retrieve extrinsic metadata 
- lister_instance_name – Name of the lister instance which triggered this load. Must be None iff lister_name is, but it may be the empty string for listers with a single instance. 
 
 - parent_origins: List[Origin] | None#
- If the given origin is a “forge fork” (ie. created with the “Fork” button of GitHub-like forges), - build_extrinsic_origin_metadata()sets this to a list of origins it was forked from; closest parent first.
 - classmethod from_config(storage: Dict[str, Any], overrides: Dict[str, Any] | None = None, **extra_kwargs: Any)[source]#
- Instantiate a loader from a configuration dict. - This is basically a backwards-compatibility shim for the CLI. - Parameters:
- storage – instantiation config for the storage 
- overrides – A dict of extra configuration for loaders. Maps fully qualified class names (e.g. - "swh.loader.git.loader.GitLoader") to a dict of extra keyword arguments to pass to this (and only this) loader.
- extra_kwargs – all extra keyword arguments are passed to all loaders 
 
- Returns:
- the instantiated loader 
 
 - classmethod from_configfile(**kwargs: Any)[source]#
- Instantiate a loader from the configuration loaded from the SWH_CONFIG_FILENAME envvar, with potential extra keyword arguments if their value is not None. - Parameters:
- kwargs – kwargs passed to the loader instantiation 
 
 - flush() Dict[str, int][source]#
- Flush any potential buffered data not sent to swh-storage. Returns the same value as - swh.storage.interface.StorageInterface.flush().
 - prepare() None[source]#
- Second step executed by the loader to prepare some state needed by
- the loader. 
- Raises
- NotFound exception if the origin to ingest is not found. 
 
 - get_origin() Origin[source]#
- Get the origin that is currently being loaded. self.origin should be set in - prepare_origin()- Returns:
- an origin ready to be sent to storage by - origin_add().
- Return type:
 
 - fetch_data() bool[source]#
- Fetch the data from the source the loader is currently loading
- (ex: git/hg/svn/… repository). 
 - Returns:
- a value that is interpreted as a boolean. If True, fetch_data needs to be called again to complete loading. 
 
 - process_data() bool[source]#
- Run any additional processing between fetching and storing the data - Returns:
- a value that is interpreted as a boolean. If True, - fetch_data()needs to be called again to complete loading. Ignored if- fetch_data()already returned- False.
 
 - store_data() None[source]#
- Store fetched and processed data in the storage. - This should call the storage.<object>_add methods, which handle the objects to store in the storage. 
 - load_status() Dict[str, str][source]#
- Detailed loading status. - Defaults to logging an eventful load. - Returns: a dictionary that is eventually passed back as the task’s
- result to the scheduler, allowing tuning of the task recurrence mechanism. 
 
 - post_load(success: bool = True) None[source]#
- Permit the loader to do some additional actions according to status after the loading is done. The flag success indicates the loading’s status. - Defaults to doing nothing. - This is up to the implementer of this method to make sure this does not break. - Parameters:
- success (bool) – the success status of the loading 
 
 - pre_cleanup() None[source]#
- As a first step, will try and check for dangling data to cleanup. This should do its best to avoid raising issues. 
 - build_partial_snapshot() Snapshot | None[source]#
- When the loader is configured to serialize partial snapshot, this allows the loader to give an implementation that builds a partial snapshot. This is used when the ingestion is taking multiple calls to - fetch_data()and- store_data(). Ignored when the loader is not configured to serialize partial snapshot.
 - load() Dict[str, str][source]#
- Loading logic for the loader to follow: - Store the actual - origin_visitto storage
- Call - prepare()to prepare any eventual state
- Call - get_origin()to get the origin we work with and store
- while True: - Call - fetch_data()to fetch the data to store
- Call - process_data()to optionally run processing between- fetch_data()and- store_data()
- Call - store_data()to store the data
 
 
 - load_metadata_objects(metadata_objects: List[RawExtrinsicMetadata]) None[source]#
 - build_extrinsic_origin_metadata() List[RawExtrinsicMetadata][source]#
- Builds a list of full RawExtrinsicMetadata objects, using a metadata fetcher returned by - get_fetcher_classes().
 - statsd_timed(name: str, tags: Dict[str, Any] = {}) ContextManager[source]#
- Wrapper for - swh.core.statsd.Statsd.timed(), which uses the standard metric name and tags for loaders.
 
- class swh.loader.core.loader.NodeLoader(storage: StorageInterface, url: str, checksums: Dict[str, str], checksums_computation: str | None = None, checksum_layout: str | None = None, fallback_urls: List[str] | None = None, **kwargs)[source]#
- Bases: - BaseLoader,- ABC- Common abstract class for - ContentLoaderand- Directoryloader.- The “checksums” field is a dictionary of hex hashes on the object retrieved (content or directory). When “checksum_layout” is “standard”, the checksums are computed on the content of the remote file to retrieve itself (as unix cli allows, “sha1sum”, “sha256sum”, …). When “checksum_layout” is “nar”, the checks is delegated to Nar class (which does an equivalent hash computation as the nix store –dump cli). It’s actually checksums on the content of the remote artifact retrieved (be it a file or an archive). Other “checksum_layout” will raise UnsupportedChecksumLayout. - The multiple “fallback” urls received are mirror urls only used to fetch the object if the main origin is no longer available. Those are not stored. - Ingestion is considered eventful on the first ingestion. Subsequent load of the same object should end up being an uneventful visit (matching snapshot). - extid_version = 1#
 - prepare() None[source]#
- Second step executed by the loader to prepare some state needed by
- the loader. 
- Raises
- NotFound exception if the origin to ingest is not found. 
 
 - load_status() Dict[str, Any][source]#
- Detailed loading status. - Defaults to logging an eventful load. - Returns: a dictionary that is eventually passed back as the task’s
- result to the scheduler, allowing tuning of the task recurrence mechanism. 
 
 - abstract fetch_artifact() Iterator[Path][source]#
- This fetches an artifact representation and yields its associated local representation (as Path). Depending on the implementation, this may yield contents coming from a remote location, or directories coming from tarball, svn tree, git tree, hg tree, … - Raises
- NotFound if nothing is found; ValueError in case of mismatched checksums 
 
 - abstract process_artifact(artifact_path: Path) None[source]#
- Build the DAG objects out of the locally retrieved artifact. 
 - fetch_data() bool[source]#
- Fetch artifact (e.g. content, directory), checks and ingests the DAG objects coming from the artifact. - This iterates over the generator - fetch_artifact()to retrieve artifact. As soon as one is retrieved and pass the checks (e.g. nar checks if the “checksum_layout” is “nar”), the method proceeds with the DAG ingestion as usual. If the artifact does not pass the check, this tries to retrieve the next mirrored artifact. If no artifacts is retrievable, this raises.- Raises
- NotFound if no artifact is found; ValueError in case of mismatched checksums 
 
 
- class swh.loader.core.loader.ContentLoader(*args, **kwargs)[source]#
- Bases: - NodeLoader- Basic loader for edge case ingestion of url resolving to bare ‘content’ file. - A visit ends up in full visit with a snapshot when the artifact is retrieved with success, match the checksums provided and is ingested with success in the archive. - An extid mapping entry is recorded in the extid table. The extid_type depends on the checksums’ type provided (see - NodeLoaderdocstring).- The output snapshot has the following structure: - id: <bytes> branches: HEAD: target_type: content target: <content-id> - fetch_artifact() Iterator[Path][source]#
- Iterates over the mirror urls to find a content. - Raises
- NotFound if nothing is found; ValueError in case of any error when fetching/computing (length, checksums mismatched…) 
 
 
- class swh.loader.core.loader.BaseDirectoryLoader(*args, path_filter: ~typing.Callable[[bytes, bytes, ~typing.Iterable[bytes] | None], bool] = <function accept_all_paths>, **kwargs)[source]#
- Bases: - NodeLoader- Abstract base Directory Loader for ‘tree’ ingestion (through any media). - Implementations should inherit from this class and provide the: - required - fetch_artifact()method to retrieve the Directory (from the proper media protocol, e.g. git, svn, hg, …)
- optional - build_snapshot()method to build the Snapshot with the proper structure if the default is not enough.
 - process_artifact(artifact_path: Path) None[source]#
- Build the Directory and other DAG objects out of the remote artifact retrieved (self.artifact_path). - This needs to happen in this method because it’s within a context manager block. 
 - build_snapshot() Snapshot[source]#
- Build and return the snapshot to store in the archive. - By default, this builds the snapshot with the structure: - id: <bytes> branches: HEAD: target_type: directory target: <directory-id> - Other directory loader implementations could override this method to build a more specific snapshot. 
 
- class swh.loader.core.loader.TarballDirectoryLoader(*args, path_filter: ~typing.Callable[[bytes, bytes, ~typing.Iterable[bytes] | None], bool] = <function accept_all_paths>, **kwargs)[source]#
- Bases: - BaseDirectoryLoader- TarballDirectoryLoader for ingestion of url resolving to a tarball. The tarball is uncompressed and checked against its provided checksums (either standard checksums or - Narchecksums).- A visit ends up in full visit with a snapshot when the artifact is retrieved with success, match the checksums provided and is ingested with success in the archive. - An extid mapping entry is recorded in the extid table. The extid_type depends on the checksums’ type provided (see - NodeLoaderdocstring).- The output snapshot has the following structure: - id: <bytes> branches: HEAD: target_type: directory target: <directory-id>