swh.web.utils.archive module#
- swh.web.utils.archive.lookup_multiple_hashes(hashes)[source]#
- Lookup the passed hashes in a single DB connection, using batch processing. - Parameters:
- {filename (An array of) – X, sha1: Y}, string X, hex sha1 string Y. 
- Returns:
- The same array with elements updated with elem[‘found’] = true if the hash is present in storage, elem[‘found’] = false if not. 
 
- swh.web.utils.archive.lookup_hash(q: str) Dict[str, Any][source]#
- Check if the storage contains a given content checksum and return it if found. - Parameters:
- q – query string of the form <hash_algo:hash> 
- Returns:
- Dict with key found containing the hash info if the 
 - hash is present, None if not. 
- swh.web.utils.archive.search_hash(q: str) Dict[str, bool][source]#
- Search storage for a given content checksum. - Parameters:
- q – query string of the form <hash_algo:hash> 
- Returns:
- Dict with key found to True or False, according to whether the checksum is present or not 
 
- swh.web.utils.archive.lookup_content_filetype(q)[source]#
- Return filetype information from a specified content. - Parameters:
- q – query string of the form <hash_algo:hash> 
- Yields:
- filetype information (dict) list if the content is found. 
 
- swh.web.utils.archive.lookup_content_language(q)[source]#
- Always returns None. - This used to return language information from a specified content, but this is currently disabled. - Parameters:
- q – query string of the form <hash_algo:hash> 
- Yields:
- language information (dict) list if the content is found. 
 
- swh.web.utils.archive.lookup_content_license(q)[source]#
- Return license information from a specified content. - Parameters:
- q – query string of the form <hash_algo:hash> 
- Yields:
- license information (dict) list if the content is found. 
 
- swh.web.utils.archive.lookup_origin(origin_url: str, lookup_similar_urls: bool = True) OriginInfo[source]#
- Return information about the origin matching dict origin. - Parameters:
- origin_url – URL of origin 
- lookup_similar_urls – if - True, lookup origin with and without trailing slash in its URL
 
- Returns:
- origin information as dict. 
 
- swh.web.utils.archive.lookup_origins(page_token: str | None, limit: int = 100) PagedResult[OriginInfo, str][source]#
- Get list of archived software origins in a paginated way. - Origins are sorted by id before returning them 
- swh.web.utils.archive.lookup_origin_snapshots(origin: OriginInfo) List[str][source]#
- Return ids of the snapshots of an origin. - Parameters:
- origin – origin’s dict with ‘url’ key 
- Returns:
- List of unique snapshot identifiers in hexadecimal format resulting from the visits of the origin. 
 
- swh.web.utils.archive.search_origin(url_pattern: str, use_ql: bool = False, limit: int = 50, with_visit: bool = False, visit_types: List[str] | None = None, page_token: str | None = None) Tuple[List[OriginInfo], str | None][source]#
- Search for origins whose urls contain a provided string pattern or match a provided regular expression. - Parameters:
- url_pattern – the string pattern to search for in origin urls 
- use_ql – whether to use swh search query language or not 
- limit – the maximum number of found origins to return 
- with_visit – Whether origins with no visit are to be filtered out 
- visit_types – Only origins having any of the provided visit types (e.g. git, svn, pypi) will be returned 
- page_token – opaque string used to get the next results of a search 
 
- Returns:
- list of origin information as dict. 
 
- swh.web.utils.archive.search_origin_metadata(fulltext: str, limit: int = 50, return_metadata: bool = True) Iterable[OriginMetadataInfo][source]#
- Search for origins whose metadata match a provided string pattern. - Parameters:
- fulltext – the string pattern to search for in origin metadata 
- limit – the maximum number of found origins to return 
- return_metadata – if false, will only return the origin URL 
 
- Returns:
- Iterable of origin metadata information for existing origins 
 
- swh.web.utils.archive.lookup_origin_intrinsic_metadata(origin_url: str, lookup_similar_urls: bool = True) list[Dict[str, Any]][source]#
- Return intrinsic metadata for the given origin (as a JSON-LD/CodeMeta dictionary). - Parameters:
- origin_url – origin url 
- lookup_similar_urls – if - True, lookup origin with and without trailing slash in its URL
 
- Raises:
- swh.web.utils.exc.NotFoundExc – when the origin is not found 
- Returns:
- origin metadata. 
 
- swh.web.utils.archive.lookup_origin_intrinsic_citation_metadata(origin_url: str, lookup_similar_urls: bool = True) List[IntrinsicMetadataFile][source]#
- Get raw intrinsic metadata given a software origin, respectively original codemeta.json and citation.cff, for the latest visit snapshot main branch root directory. - Parameters:
- origin_url – origin url 
- lookup_similar_urls – if - True, lookup origin with and without trailing slash in its URL
 
- Returns:
- list of intrinsic metadata files info 
- Raises:
- swh.web.utils.exc.NotFoundExc – when snapshot, branch or directory is missing or no metadata could be found 
- BadInputExc – when the metadata files could not be decoded 
 
 
- swh.web.utils.archive.lookup_intrinsic_citation_metadata_by_target_swhid(target_swhid: str) List[IntrinsicMetadataFile][source]#
- Get raw intrinsic metadata given a SWHID, respectively original codemeta.json and citation.cff, for the target object. If the target object is of type - Snapshot, get metadata from the main branch (- HEAD).- Parameters:
- target_swhid – SWHID which can be qualified or not, if the target object is of type - Content, it must be qualified with an anchor.
- Returns:
- list of intrinsic metadata files info 
- Raises:
- swh.web.utils.exc.NotFoundExc – when the target object is missing or no metadata could be found 
- BadInputExc – when the metadata files could not be decoded 
 
 
- swh.web.utils.archive.lookup_origin_extrinsic_metadata(origin_url: str, lookup_similar_urls: bool = True) list[Dict[str, Any]][source]#
- Return extrinsic metadata for the given origin (as a JSON-LD/CodeMeta dictionary). - Parameters:
- origin_url – origin url 
- lookup_similar_urls – if - True, lookup origin with and without trailing slash in its URL
 
- Raises:
- swh.web.utils.exc.NotFoundExc – when the origin is not found 
- Returns:
- origin metadata. 
 
- swh.web.utils.archive.directory_exists(sha1_git: str) bool[source]#
- Checks if a directory can be found in the archive. - Parameters:
- sha1_git – directory identifier 
- Returns:
- whether the directory exists in the archive. 
 
- swh.web.utils.archive.lookup_directory(sha1_git)[source]#
- Return information about the directory with id sha1_git. - Parameters:
- string (sha1_git as) 
- Returns:
- directory information as dict. 
 
- swh.web.utils.archive.lookup_directory_with_path(sha1_git: str, path: str) Dict[str, Any][source]#
- Return directory information for entry with specified path w.r.t. root directory pointed by sha1_git - Parameters:
- sha1_git – sha1_git corresponding to the directory to which we append paths to (hopefully) find the entry 
- path – the relative path to the entry starting from the root directory pointed by sha1_git 
 
- Returns:
- Directory entry information as dict. 
- Raises:
- swh.web.utils.exc.NotFoundExc – if the directory entry is not found 
 
- swh.web.utils.archive.lookup_release(release_sha1_git: str) Dict[str, Any][source]#
- Return information about the release with sha1 release_sha1_git. - Parameters:
- release_sha1_git – The release’s sha1 as hexadecimal 
- Returns:
- Release information as dict. 
- Raises:
- ValueError – if the identifier provided is not of sha1 nature. 
- swh.web.utils.exc.NotFoundExc – if there is no release with the provided sha1_git. 
 
 
- swh.web.utils.archive.lookup_release_multiple(sha1_git_list) Iterator[Dict[str, Any] | None][source]#
- Return information about the releases identified with their sha1_git identifiers. - Parameters:
- sha1_git_list – A list of release sha1_git identifiers 
- Returns:
- Iterator of Release metadata information as dict. 
- Raises:
- ValueError if the identifier provided is not of sha1 nature. – 
 
- swh.web.utils.archive.lookup_revision(rev_sha1_git) Dict[str, Any][source]#
- Return information about the revision with sha1 revision_sha1_git. - Parameters:
- revision_sha1_git – The revision’s sha1 as hexadecimal 
- Returns:
- Revision information as dict. 
- Raises:
- ValueError – if the identifier provided is not of sha1 nature. 
- swh.web.utils.exc.NotFoundExc – if there is no revision with the provided sha1_git. 
 
 
- swh.web.utils.archive.lookup_revision_multiple(sha1_git_list) Iterator[Dict[str, Any] | None][source]#
- Return information about the revisions identified with their sha1_git identifiers. - Parameters:
- sha1_git_list – A list of revision sha1_git identifiers 
- Yields:
- revision information as dict if the revision exists, None otherwise. 
- Raises:
- ValueError if the identifier provided is not of sha1 nature. – 
 
- swh.web.utils.archive.lookup_revision_message(rev_sha1_git) Dict[str, bytes][source]#
- Return the raw message of the revision with sha1 revision_sha1_git. - Parameters:
- revision_sha1_git – The revision’s sha1 as hexadecimal 
- Returns:
- <the_message>} 
- Return type:
- Decoded revision message as dict {‘message’ 
- Raises:
- ValueError – if the identifier provided is not of sha1 nature. 
- swh.web.utils.exc.NotFoundExc – if the revision is not found, or if it has no message 
 
 
- swh.web.utils.archive.lookup_revision_by(origin_url: str, branch_name: str = 'HEAD', timestamp: int | str | None = None)[source]#
- Lookup revision by origin, snapshot branch name and visit timestamp. - If branch_name is not provided, lookup using ‘HEAD’ as default. If timestamp is not provided, use the most recent. - Parameters:
- origin_url – URL of origin to lookup revision 
- branch_name – snapshot branch name 
- timestamp – origin visit time frame 
 
- Returns:
- The revision matching the criterions 
- Return type:
- Raises:
- swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion 
 
- swh.web.utils.archive.lookup_revision_log(rev_sha1_git, limit)[source]#
- Lookup revision log by revision id. - Parameters:
- Returns:
- Revision log as list of revision dicts 
- Return type:
- Raises:
- ValueError – if the identifier provided is not of sha1 nature. 
- swh.web.utils.exc.NotFoundExc – if there is no revision with the provided sha1_git. 
 
 
- swh.web.utils.archive.lookup_revision_log_by(origin, branch_name, timestamp, limit)[source]#
- Lookup revision by origin, snapshot branch name and visit timestamp. - Parameters:
- Returns:
- Revision log as list of revision dicts 
- Return type:
- Raises:
- swh.web.utils.exc.NotFoundExc – if no revision corresponds to the criterion 
 
- swh.web.utils.archive.lookup_revision_with_context_by(origin, branch_name, timestamp, sha1_git, limit=100)[source]#
- Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. sha1_git_root being resolved through the lookup of a revision by origin, branch_name and ts. - In other words, sha1_git is an ancestor of sha1_git_root. - Parameters:
- origin (-) – origin of the revision. 
- branch_name (-) – revision’s branch. 
- timestamp (-) – revision’s time frame. 
- sha1_git (-) – one of sha1_git_root’s ancestors. 
- limit (-) – limit the lookup to 100 revisions back. 
 
- Returns:
- Pair of (root_revision, revision). Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root 
- Raises:
- - BadInputExc – in case of unknown algo_hash or bad hash. 
- - swh.web.utils.exc.NotFoundExc – if either revision is not found or if sha1_git is not an ancestor of sha1_git_root. 
 
 
- swh.web.utils.archive.lookup_revision_with_context(sha1_git_root: str | Dict[str, Any] | Revision, sha1_git: str, limit: int = 100) Dict[str, Any][source]#
- Return information about revision sha1_git, limited to the sub-graph of all transitive parents of sha1_git_root. - In other words, sha1_git is an ancestor of sha1_git_root. - Parameters:
- sha1_git_root – latest revision. The type is either a sha1 (as an hex 
- dict. (string) or a non converted) 
- sha1_git – one of sha1_git_root’s ancestors 
- limit – limit the lookup to 100 revisions back 
 
- Returns:
- Information on sha1_git if it is an ancestor of sha1_git_root including children leading to sha1_git_root 
- Raises:
- BadInputExc – in case of unknown algo_hash or bad hash 
- swh.web.utils.exc.NotFoundExc – if either revision is not found or if sha1_git is not an 
- ancestor of sha1_git_root – 
 
 
- swh.web.utils.archive.lookup_directory_with_revision(sha1_git: str, dir_path: str | None = None, with_data: bool = False) Dict[str, Any][source]#
- Return information on directory pointed by revision with sha1_git. If dir_path is not provided, display top level directory. Otherwise, display the directory pointed by dir_path (if it exists). - Parameters:
- sha1_git – revision’s hash. 
- dir_path – optional directory pointed to by that revision. 
- with_data – boolean that indicates to retrieve the raw data if the path 
- False (resolves to a content. Default to) 
 
- Returns:
- Information on the directory pointed to by that revision. 
- Raises:
- BadInputExc – in case of unknown algo_hash or bad hash. 
- swh.web.utils.exc.NotFoundExc – either if the revision is not found or the path referenced does not exist. 
- NotImplementedError – in case of dir_path exists but do not reference a 
- type 'dir' or 'file'. – 
 
 
- swh.web.utils.archive.lookup_content(q: str, json_convert: bool = True, with_data: bool = False) Dict[str, Any][source]#
- Lookup the content designed by q. - Parameters:
- q – query string of the form <hash_algo:hash> 
- json_convert – whether to convert content metadata in JSON serializable format 
- with_data – whether to fetch and return content bytes 
 
- Returns:
- a dict holding content metadata and possibly its raw bytes 
- Raises:
- swh.web.utils.exc.NotFoundExc – if the requested content or its bytes are not found 
 
- swh.web.utils.archive.stat_counters()[source]#
- Return the stat counters for Software Heritage - Returns:
- A dict mapping textual labels to integer values. 
 
- swh.web.utils.archive.lookup_origin_visits(origin: str, last_visit: int | None = None, per_page: int = 10) Iterator[OriginVisitInfo][source]#
- Yields the origin origins’ visits. - Parameters:
- origin – origin to list visits for 
- Yields:
- Dictionaries of origin_visit for that origin 
 
- swh.web.utils.archive.lookup_origin_visit_latest(origin_url: str, require_snapshot: bool = False, type: str | None = None, allowed_statuses: List[str] | None = None, lookup_similar_urls: bool = True) OriginVisitInfo | None[source]#
- Return the origin’s latest visit - Parameters:
- origin_url – origin to list visits for 
- type – Optional visit type to filter on (e.g git, svn, hg, npm, pypi, …) 
- allowed_statuses – list of visit statuses considered to find the latest visit. For instance, - allowed_statuses=['full']will only consider visits that have successfully run to completion.
- require_snapshot – filter out origins without a snapshot 
- lookup_similar_urls – if - True, lookup origin with and without trailing slash in its URL
 
- Returns:
- The origin visit info as dict if found 
 
- swh.web.utils.archive.lookup_origin_visit(origin_url: str, visit_id: int, lookup_similar_urls: bool = True) OriginVisitInfo[source]#
- Return information about visit visit_id with origin origin. - Parameters:
- origin – origin concerned by the visit 
- visit_id – the visit identifier to lookup 
- lookup_similar_urls – if - True, lookup origin with and without trailing slash in its URL
 
- Raises:
- swh.web.utils.exc.NotFoundExc – if no origin visit matching the criteria is found 
- Returns:
- The dict origin_visit concerned 
 
- swh.web.utils.archive.origin_visit_find_by_date(origin_url: str, visit_date: datetime, greater_or_equal: bool = True, type: str | None = None) OriginVisitInfo | None[source]#
- Retrieve origin visit status whose date is most recent than the provided visit_date. - Parameters:
- origin_url – origin concerned by the visit 
- visit_date – provided visit date 
- greater_or_equal – ensure returned visit has a date greater or equal than the one passed as parameter 
- type – Optional visit type to filter on (e.g git, svn, hg, npm, pypi, …) 
 
- Returns:
- The dict origin_visit_status matching the criteria if any. 
 
- swh.web.utils.archive.lookup_snapshot_sizes(snapshot_id: str, branch_name_exclude_prefix: str | None = 'refs/pull/') Dict[str, int][source]#
- Count the number of branches in the snapshot with the given id. 
- swh.web.utils.archive.lookup_snapshot(snapshot_id: str, branches_from: str = '', branches_count: int = 1000, target_types: List[str] | None = None, branch_name_include_substring: str | None = None, branch_name_exclude_prefix: str | None = 'refs/pull/') Dict[str, Any][source]#
- Return information about a snapshot, aka the list of named branches found during a specific visit of an origin. - Parameters:
- snapshot_id – sha1 identifier of the snapshot 
- branches_from – optional parameter used to skip branches whose name is lesser than it before returning them 
- branches_count – optional parameter used to restrain the amount of returned branches 
- target_types – optional parameter used to filter the target types of branch to return (possible values that can be contained in that list are ‘content’, ‘directory’, ‘revision’, ‘release’, ‘snapshot’, ‘alias’) 
- branch_name_include_substring – if provided, only return branches whose name contains given substring 
- branch_name_exclude_prefix – if provided, do not return branches whose name starts with given pattern 
 
- Raises:
- swh.web.utils.exc.NotFoundExc – if the given snapshot_id is missing 
- Returns:
- A dict filled with the snapshot content. 
 
- swh.web.utils.archive.lookup_latest_origin_snapshot(origin: str, allowed_statuses: List[str] | None = None) Dict[str, Any] | None[source]#
- Return information about the latest snapshot of an origin. - Warning - At most 1000 branches contained in the snapshot will be returned for performance reasons. - Parameters:
- origin – URL or integer identifier of the origin 
- allowed_statuses – list of visit statuses considered to find the latest snapshot for the visit. For instance, - allowed_statuses=['full']will only consider visits that have successfully run to completion.
 
- Returns:
- A dict filled with the snapshot content. 
 
- swh.web.utils.archive.lookup_snapshot_alias(snapshot_id: str, alias_name: str) Dict[str, Any] | None[source]#
- Try to resolve a branch alias in a snapshot. - Parameters:
- snapshot_id – hexadecimal representation of a snapshot id 
- alias_name – name of the branch alias to resolve 
 
- Returns:
- Target branch information or None if the alias does not exist or target a dangling branch. 
 
- swh.web.utils.archive.lookup_revision_through(revision, limit=100)[source]#
- Retrieve a revision from the criterion stored in revision dictionary. - Parameters:
- revision – Dictionary of criterion to lookup the revision with. 
- values (Here are the supported combination of possible) 
- origin_url (-) 
- branch_name 
- ts 
- sha1_git (-) 
- origin_url 
- branch_name 
- ts 
- sha1_git_root (-) 
- sha1_git 
- sha1_git 
 
- Returns:
- None if the revision is not found or the actual revision. 
 
- swh.web.utils.archive.lookup_directory_through_revision(revision, path=None, limit=100, with_data=False)[source]#
- Retrieve the directory information from the revision. - Parameters:
- revision – dictionary of criterion representing a revision to lookup 
- path – directory’s path to lookup. 
- limit – optional query parameter to limit the revisions log (default to 100). For now, note that this limit could impede the transitivity conclusion about sha1_git not being an ancestor of. 
- with_data – indicate to retrieve the content’s raw data if path resolves to a content. 
 
- Returns:
- The directory pointing to by the revision criterions at path. 
 
- swh.web.utils.archive.vault_cook(bundle_type: str, swhid: CoreSWHID, email=None)[source]#
- Cook a vault bundle. 
- swh.web.utils.archive.vault_download(bundle_type: str, swhid: CoreSWHID)[source]#
- Fetch a vault bundle. 
- swh.web.utils.archive.vault_download_url(bundle_type: str, swhid: CoreSWHID, filename: str) str | None[source]#
- Get optional direct download URL for a cooked vault bundle. 
- swh.web.utils.archive.vault_progress(bundle_type: str, swhid: CoreSWHID)[source]#
- Get the current progress of a vault bundle. 
- swh.web.utils.archive.diff_revision(rev_id)[source]#
- Get the list of file changes (insertion / deletion / modification / renaming) for a particular revision. 
- swh.web.utils.archive.get_revisions_walker(rev_walker_type, rev_start, *args, **kwargs)[source]#
- Utility function to instantiate a revisions walker of a given type, see - swh.storage.algos.revisions_walker.- Parameters:
- rev_walker_type (str) – the type of revisions walker to return, possible values are: - committer_date,- dfs,- dfs_post,- bfsand- path
- rev_start (str) – hexadecimal representation of a revision identifier 
- args (list) – position arguments to pass to the revisions walker constructor 
- kwargs (dict) – keyword arguments to pass to the revisions walker constructor 
 
 
- swh.web.utils.archive.lookup_object(object_type: ObjectType, object_id: str) Dict[str, Any][source]#
- Utility function for looking up an object in the archive by its type and id. - Parameters:
- Returns:
- A dictionary describing the object or a list of dictionary for the directory object type. 
- Return type:
- Dict[str, Any] 
- Raises:
- swh.web.utils.exc.NotFoundExc – if the object could not be found in the archive 
- BadInputExc – if the object identifier is invalid 
 
 
- swh.web.utils.archive.lookup_missing_hashes(grouped_swhids: Dict[ObjectType, List[bytes]]) Set[str][source]#
- Lookup missing SoftWare Hash IDentifiers using batch processing. - Parameters:
- with (A dictionary) 
- keys – object types 
- values – object hashes 
 
- Returns:
- A set(hexadecimal) of the hashes not found in the storage 
 
- swh.web.utils.archive.lookup_origins_by_sha1s(sha1s: List[str]) Iterator[OriginInfo | None][source]#
- Lookup origins from the sha1 hash values of their URLs. - Parameters:
- sha1s – list of sha1s hexadecimal representation 
- Yields:
- origin information as dict 
 
- swh.web.utils.archive.lookup_extid(extid_type: str, extid_format: str, extid: str, extid_version: int | None = None) Dict[str, Any][source]#
- Lookup an ExtID by its type and value. - Parameters:
- extid_type – the type of the ExtID 
- extid_format – the format used to encode the extid in an ASCII string, either - base64url,- hexor- raw.
- extid – the value of the ExtID 
- extid_version – the version of the ExtID 
 
- Returns:
- ExtID information as a dict 
 
- swh.web.utils.archive.lookup_extid_by_target(swhid: str, extid_type: str | None = None, extid_version: int | None = None, extid_format: str = 'hex') List[Dict[str, Any]][source]#
- Lookup ExtIDs targeting an archived object. - Parameters:
- extid_type – the type of the ExtID 
- extid_format – the format to use for encoding an extid to an ASCII string, either - base64url,- hexor- raw.
- extid – the value of the ExtID 
- extid_version – the version of the ExtID 
 
- Returns:
- ExtIDs information as a list of dict 
 
- swh.web.utils.archive.lookup_raw_extrinsic_metadata(target_swhid: ExtendedSWHID, authority: MetadataAuthority, after: datetime | None = None, page_token: str | None = None, limit: int = 100) PagedResult[Dict[str, Any], str][source]#