swh.indexer.storage.api.client module#
- class swh.indexer.storage.api.client.RemoteStorage(url: str, timeout: None | Tuple[float, float] | List[float] | float = None, chunk_size: int = 4096, max_retries: int = 3, pool_connections: int = 20, pool_maxsize: int = 100, adapter_kwargs: Dict[str, Any] | None = None, api_exception: Type[Exception] | None = None, reraise_exceptions: List[Type[Exception]] | None = None, enable_requests_retry: bool | None = None, **kwargs)[source]#
- Bases: - RPCClient- Proxy to a remote storage API - backend_class#
- alias of - IndexerStorageInterface
 - api_exception#
- alias of - IndexerStorageAPIError
 - reraise_exceptions: List[Type[Exception]] = [<class 'swh.indexer.storage.exc.IndexerStorageArgumentException'>, <class 'swh.indexer.storage.exc.DuplicateId'>]#
- On server errors, if any of the exception classes in this list has the same name as the error name, then the exception will be instantiated and raised instead of a generic RemoteException. 
 - extra_type_decoders: Dict[str, Callable] = {'idx_model': <function <lambda>>}#
- Value of extra_decoders passed to json_loads or msgpack_loads to be able to deserialize more object types. 
 - extra_type_encoders: List[Tuple[type, str, Callable]] = [(<class 'swh.indexer.storage.model.BaseRow'>, 'idx_model', <function _encode_model_object>)]#
- Value of extra_encoders passed to json_dumps or msgpack_dumps to be able to serialize more object types. 
 - check_config(*, check_write)#
- Check that the storage is configured and ready to go. 
 - content_fossology_license_add(licenses: List[ContentLicenseRow]) Dict[str, int]#
- Add licenses not present in storage. - Parameters:
- license – license rows to be added, with their tool attribute set to 
- None. 
 
- Returns:
- Dict summary of number of rows added 
 
 - content_fossology_license_get(ids: Iterable[bytes]) List[ContentLicenseRow]#
- Retrieve licenses per id. - Parameters:
- ids – sha1 identifiers 
- Yields:
- license rows; possibly more than one per (sha1, tool_id) if there are multiple licenses. 
 
 - content_fossology_license_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str]#
- Retrieve licenses within the partition partition_id bound by limit. - Parameters:
- **indexer_configuration_id** – The tool used to index data 
- **partition_id** – index of the partition to fetch 
- **nb_partitions** – total number of partitions to split into 
- **page_token** – opaque token used for pagination 
- **limit** – Limit result (default to 1000) 
 
- Raises:
- IndexerStorageArgumentException for; – 
- - limit to None – 
- - wrong indexer_type provided – 
 
 - Returns: PagedResult of Sha1. If next_page_token is None, there is no more data
- to fetch 
 
 - content_metadata_add(metadata: List[ContentMetadataRow]) Dict[str, int]#
- Add metadata not present in storage. - Parameters:
- metadata (iterable) – - dictionaries with keys: - id: sha1 
- metadata: arbitrary dict 
 
- Returns:
- Dict summary of number of rows added 
 
 - content_metadata_get(ids: Iterable[bytes]) List[ContentMetadataRow]#
- Retrieve metadata per id. - Parameters:
- ids (iterable) – sha1 checksums 
- Yields:
- dictionaries with the following keys – id (bytes) metadata (str): associated metadata tool (dict): tool used to compute metadata 
 
 - content_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]]#
- List metadata missing from storage. - Parameters:
- metadata (iterable) – - dictionaries with keys: - id (bytes): sha1 identifier 
- indexer_configuration_id (int): tool used to compute the results 
 
- Yields:
- missing sha1s 
 
 - content_mimetype_add(mimetypes: List[ContentMimetypeRow]) Dict[str, int]#
- Add mimetypes not present in storage. - Parameters:
- mimetypes – mimetype rows to be added, with their tool attribute set to 
- None. 
- overwrite ( - True)
- default) 
 
- Returns:
- Dict summary of number of rows added 
 
 - content_mimetype_get(ids: Iterable[bytes]) List[ContentMimetypeRow]#
- Retrieve full content mimetype per ids. - Parameters:
- ids – sha1 identifiers 
- Returns:
- mimetype row objects 
 
 - content_mimetype_get_partition(indexer_configuration_id: int, partition_id: int, nb_partitions: int, page_token: str | None = None, limit: int = 1000) PagedResult[bytes, str]#
- Retrieve mimetypes within partition partition_id bound by limit. - Parameters:
- **indexer_configuration_id** – The tool used to index data 
- **partition_id** – index of the partition to fetch 
- **nb_partitions** – total number of partitions to split into 
- **page_token** – opaque token used for pagination 
- **limit** – Limit result (default to 1000) 
 
- Raises:
- IndexerStorageArgumentException for; – 
- - limit to None – 
- - wrong indexer_type provided – 
 
- Returns:
- PagedResult of Sha1. If next_page_token is None, there is no more data to fetch 
 
 - content_mimetype_missing(mimetypes: Iterable[Dict]) List[Tuple[bytes, int]]#
- Generate mimetypes missing from storage. - Parameters:
- mimetypes (iterable) – - iterable of dict with keys: - id (bytes): sha1 identifier 
- indexer_configuration_id (int): tool used to compute the results 
 
- Returns:
- list of tuple (id, indexer_configuration_id) missing 
 
 - directory_intrinsic_metadata_add(metadata: List[DirectoryIntrinsicMetadataRow]) Dict[str, int]#
- Add metadata not present in storage. - Parameters:
- metadata – ContentMetadataRow objects 
- Returns:
- Dict summary of number of rows added 
 
 - directory_intrinsic_metadata_get(ids: Iterable[bytes]) List[DirectoryIntrinsicMetadataRow]#
- Retrieve directory metadata per id. - Parameters:
- ids (iterable) – sha1 checksums 
- Returns:
- ContentMetadataRow objects 
 
 - directory_intrinsic_metadata_missing(metadata: Iterable[Dict]) List[Tuple[bytes, int]]#
- List metadata missing from storage. - Parameters:
- metadata (iterable) – - dictionaries with keys: - id (bytes): sha1_git directory identifier 
- indexer_configuration_id (int): tool used to compute the results 
 
- Returns:
- missing ids 
 
 - indexer_configuration_add(tools)#
- Add new tools to the storage. - Parameters:
- tools ([dict]) – - List of dictionary representing tool to insert in the db. Dictionary with the following keys: - tool_name (str): tool’s name 
- tool_version (str): tool’s version 
- tool_configuration (dict): tool’s configuration (free form dict) 
 
- Returns:
- List of dict inserted in the db (holding the id key as well). The order of the list is not guaranteed to match the order of the initial list. 
 
 - indexer_configuration_get(tool)#
- Retrieve tool information. - Parameters:
- tool (dict) – - Dictionary representing a tool with the following keys: - tool_name (str): tool’s name 
- tool_version (str): tool’s version 
- tool_configuration (dict): tool’s configuration (free form dict) 
 
- Returns:
- The same dictionary with an id key, None otherwise. 
 
 - origin_extrinsic_metadata_add(metadata: List[OriginExtrinsicMetadataRow]) Dict[str, int]#
- Add origin metadata not present in storage. - Parameters:
- metadata – list of OriginExtrinsicMetadataRow objects 
- Returns:
- Dict summary of number of rows added 
 
 - origin_extrinsic_metadata_get(urls: Iterable[str]) List[OriginExtrinsicMetadataRow]#
- Retrieve origin metadata per id. - Parameters:
- urls (iterable) – origin URLs 
 - Returns: list of OriginExtrinsicMetadataRow 
 - origin_intrinsic_metadata_add(metadata: List[OriginIntrinsicMetadataRow]) Dict[str, int]#
- Add origin metadata not present in storage. - Parameters:
- metadata – list of OriginIntrinsicMetadataRow objects 
- Returns:
- Dict summary of number of rows added 
 
 - origin_intrinsic_metadata_get(urls: Iterable[str]) List[OriginIntrinsicMetadataRow]#
- Retrieve origin metadata per id. - Parameters:
- urls (iterable) – origin URLs 
 - Returns: list of OriginIntrinsicMetadataRow 
 - origin_intrinsic_metadata_search_by_producer(page_token: str = '', limit: int = 100, ids_only: bool = False, mappings: List[str] | None = None, tool_ids: List[int] | None = None) PagedResult[str | OriginIntrinsicMetadataRow, str]#
- Returns the list of origins whose metadata contain all the terms. - Parameters:
- page_token (str) – Opaque token used for pagination. 
- limit (int) – The maximum number of results to return 
- ids_only (bool) – Determines whether only origin urls are returned or the content as well 
- mappings (List[str]) – Returns origins whose intrinsic metadata were generated using at least one of these mappings. 
 
- Returns:
- OriginIntrinsicMetadataRow objects 
 
 - origin_intrinsic_metadata_search_fulltext(conjunction: List[str], limit: int = 100) List[OriginIntrinsicMetadataRow]#
- Returns the list of origins whose metadata contain all the terms. - Parameters:
- conjunction – List of terms to be searched for. 
- limit – The maximum number of results to return 
 
- Returns:
- list of OriginIntrinsicMetadataRow 
 
 - origin_intrinsic_metadata_stats()#
- Returns counts of indexed metadata per origins, broken down into metadata types. - Returns:
- dictionary with keys: - total (int): total number of origins that were indexed (possibly yielding an empty metadata dictionary) 
- non_empty (int): total number of origins that we extracted a non-empty metadata dictionary from 
- per_mapping (dict): a dictionary with mapping names as keys and number of origins whose indexing used this mapping. Note that indexing a given origin may use 0, 1, or many mappings. 
 
- Return type: