swh.storage.postgresql.db module#
- swh.storage.postgresql.db.execute_values_generator(cur: Cursor, query: str, values: Iterable[Any]) Iterator[Any][source]#
- class swh.storage.postgresql.db.QueryBuilder[source]#
- Bases: - object- add_pagination_clause(pagination_key: List[str], cursor: Any | None, direction: ListOrder | None, limit: int | None, separator: str = 'AND') None[source]#
- Create and add a pagination clause to the query - Parameters:
- pagination_key – Pagination key to be used. Use list of strings to support alias fields 
- cursor – Pagination cursor as a query parameter 
- direction – Sort order 
- limit – Limit as a query parameter 
- separator – Separator to be used as the prefix for the clause 
 
 
 
- class swh.storage.postgresql.db.Db(conn: Connection[Any], pool: ConnectionPool | None = None)[source]#
- Bases: - BaseDb- Proxy to the SWH DB, with wrappers around stored procedures - create a DB proxy - Parameters:
- conn – psycopg connection to the SWH DB 
- pool – psycopg pool of connections 
 
 - content_get_metadata_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status']#
 - content_add_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status', 'ctime']#
 - skipped_content_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'reason', 'status', 'origin']#
 - content_get_range(start, end, limit=None, cur=None) Iterator[Tuple][source]#
- Retrieve contents within range [start, end]. 
 - content_hash_keys = ['sha1', 'sha1_git', 'sha256', 'blake2s256']#
 - content_find_cols = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'ctime', 'status']#
 - content_find(sha1: bytes | None = None, sha1_git: bytes | None = None, sha256: bytes | None = None, blake2s256: bytes | None = None, cur=None)[source]#
- Find the content optionally on a combination of the following checksums sha1, sha1_git, sha256 or blake2s256. - Parameters:
- sha1 – sha1 content 
- git_sha1 – the sha1 computed a la git sha1 of the content 
- sha256 – sha256 content 
- blake2s256 – blake2s256 content 
 
- Returns:
- The tuple (sha1, sha1_git, sha256, blake2s256) if found or None. 
 
 - skipped_content_find_cols = ['sha1', 'sha1_git', 'sha256', 'blake2s256', 'length', 'status', 'reason', 'ctime']#
 - skipped_content_find(sha1: bytes | None = None, sha1_git: bytes | None = None, sha256: bytes | None = None, blake2s256: bytes | None = None, cur=None) List[Tuple[Any]][source]#
 - directory_ls_cols = ['dir_id', 'type', 'target', 'name', 'perms', 'status', 'sha1', 'sha1_git', 'sha256', 'length']#
 - directory_entry_get_by_path(directory, paths, cur=None)[source]#
- Retrieve a directory entry by path. 
 - directory_get_entries_cols = ['type', 'target', 'name', 'perms']#
 - directory_get_raw_manifest(directory_ids: List[bytes], cur=None) Iterable[Tuple[bytes, bytes]][source]#
 - revision_add_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'committer_date_offset_bytes', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'extra_headers', 'raw_manifest']#
 - revision_get_cols = ['id', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'committer_date', 'committer_date_offset', 'committer_date_neg_utc_offset', 'committer_date_offset_bytes', 'type', 'directory', 'message', 'author_fullname', 'author_name', 'author_email', 'committer_fullname', 'committer_name', 'committer_email', 'metadata', 'synthetic', 'extra_headers', 'raw_manifest', 'parents']#
 - revision_shortlog_cols = ['id', 'parents']#
 - extid_cols = ['extid', 'extid_version', 'extid_type', 'target', 'target_type']#
 - extid_get_from_extid_list(extid_type: str, ids: List[bytes], version: int | None = None, cur=None)[source]#
 - extid_get_from_swhid_list(target_type: str, ids: List[bytes], extid_version: int | None = None, extid_type: str | None = None, cur=None)[source]#
 - release_add_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'name', 'comment', 'synthetic', 'raw_manifest', 'author_fullname', 'author_name', 'author_email']#
 - release_get_cols = ['id', 'target', 'target_type', 'date', 'date_offset', 'date_neg_utc_offset', 'date_offset_bytes', 'name', 'comment', 'synthetic', 'raw_manifest', 'author_fullname', 'author_name', 'author_email']#
 - snapshot_count_cols = ['target_type', 'count']#
 - snapshot_get_cols = ['snapshot_id', 'name', 'target', 'target_type']#
 - snapshot_get_by_id(snapshot_id, branches_from=b'', branches_count=None, target_types=None, branch_name_include_substring=None, branch_name_exclude_prefix=None, cur=None)[source]#
 - origin_visit_add(origin, ts, type, cur=None)[source]#
- Add a new origin_visit for origin origin at timestamp ts. - Parameters:
- origin – origin concerned by the visit 
- ts – the date of the visit 
- type – type of loader for the visit 
 
- Returns:
- The new visit index step for that origin 
 
 - origin_visit_status_cols = ['origin', 'visit', 'date', 'type', 'status', 'snapshot', 'metadata']#
 - origin_visit_status_add(visit_status: OriginVisitStatus, cur=None) None[source]#
- Add new origin visit status 
 - origin_visit_cols = ['origin', 'visit', 'date', 'type']#
 - origin_visit_add_with_id(origin_visit: OriginVisit, cur=None) None[source]#
- Insert origin visit when id are already set 
 - origin_visit_get_cols = ['origin', 'visit', 'date', 'type', 'status', 'metadata', 'snapshot']#
 - origin_visit_select_cols = ['o.url AS origin', 'ov.visit', 'ov.date', 'ov.type AS type', 'ovs.status', 'ovs.snapshot', 'ovs.metadata']#
 - origin_visit_status_select_cols = ['o.url AS origin', 'ovs.visit', 'ovs.date', 'ovs.type AS type', 'ovs.status', 'ovs.snapshot', 'ovs.metadata']#
 - origin_visit_status_get_latest(origin_url: str, visit: int, allowed_statuses: List[str] | None = None, require_snapshot: bool = False, cur=None) Dict[str, Any] | None[source]#
- Given an origin visit id, return its latest origin_visit_status 
 - origin_visit_status_get_range(origin: str, visit: int, date_from: datetime | None, order: ListOrder, limit: int, cur=None)[source]#
- Retrieve visit_status rows for visit (origin, visit) in a paginated way. 
 - origin_visit_get_range(origin: str, visit_from: int, order: ListOrder, limit: int, cur=None)[source]#
 - origin_visit_status_get_all_in_range(origin: str, allowed_statuses: List[str] | None, require_snapshot: bool, visit_from: int, visit_to: int, cur=None)[source]#
 - origin_visit_get(origin_id, visit_id, cur=None)[source]#
- Retrieve information on visit visit_id of origin origin_id. - Parameters:
- origin_id – the origin concerned 
- visit_id – The visit step for that origin 
 
- Returns:
- The origin_visit information 
 
 - origin_visit_exists(origin_id, visit_id, cur=None)[source]#
- Check whether an origin visit with the given ids exists 
 - origin_visit_get_latest(origin_id: str, type: str | None, allowed_statuses: Iterable[str] | None, require_snapshot: bool, cur=None)[source]#
- Retrieve the most recent origin_visit of the given origin, with optional filters. - Parameters:
- origin_id – the origin concerned 
- type – Optional visit type to filter on 
- allowed_statuses – the visit statuses allowed for the returned visit 
- require_snapshot (bool) – If True, only a visit with a known snapshot will be returned. 
 
- Returns:
- The origin_visit information, or None if no visit matches. 
 
 - origin_visit_get_random(type, cur=None)[source]#
- Randomly select one origin visit that was full and in the last 3 months 
 - origin_cols = ['url']#
 - origin_get_range_cols = ['id', 'url']#
 - origin_get_range(origin_from: int = 1, origin_count: int = 100, cur=None)[source]#
- Retrieve - origin_countorigins whose ids are greater or equal than- origin_from.- Origins are sorted by id before retrieving them. - Parameters:
- origin_from – the minimum id of origins to retrieve 
- origin_count – the maximum number of origins to retrieve 
 
 
 - origin_search(url_pattern: str, offset: int = 0, limit: int = 50, regexp: bool = False, with_visit: bool = False, visit_types: List[str] | None = None, cur=None)[source]#
- Search for origins whose urls contain a provided string pattern or match a provided regular expression. The search is performed in a case insensitive way. - Parameters:
- url_pattern – the string pattern to search for in origin urls 
- offset – number of found origins to skip before returning results 
- limit – the maximum number of found origins to return 
- regexp – if True, consider the provided pattern as a regular expression and returns origins whose urls match it 
- with_visit – if True, filter out origins with no visit 
 
 
 - origin_count(url_pattern, regexp=False, with_visit=False, cur=None)[source]#
- Count origins whose urls contain a provided string pattern or match a provided regular expression. The pattern search in origin urls is performed in a case insensitive way. 
 - object_find_by_sha1_git_cols = ['sha1_git', 'type']#
 - raw_extrinsic_metadata_get_cols = ['raw_extrinsic_metadata.target', 'raw_extrinsic_metadata.type', 'discovery_date', 'metadata_authority.type', 'metadata_authority.url', 'metadata_fetcher.id', 'metadata_fetcher.name', 'metadata_fetcher.version', 'origin', 'visit', 'snapshot', 'release', 'revision', 'path', 'directory', 'format', 'raw_extrinsic_metadata.metadata']#
- List of columns of the raw_extrinsic_metadata, metadata_authority, and metadata_fetcher tables, used when reading object metadata. 
 - raw_extrinsic_metadata_add(id: bytes, type: str, target: str, discovery_date: datetime, authority_id: int, fetcher_id: int, format: str, metadata: bytes, origin: str | None, visit: int | None, snapshot: str | None, release: str | None, revision: str | None, path: bytes | None, directory: str | None, cur)[source]#
 - raw_extrinsic_metadata_get(target: str, authority_id: int, after_time: datetime | None, after_fetcher: int | None, limit: int, cur)[source]#
 - metadata_fetcher_cols = ['name', 'version']#
 - metadata_authority_cols = ['type', 'url']#
 - object_references_create_partition(year: int, week: int, cur=None) Tuple[date, date][source]#
- Create the partition of the object_references table for the given ISO - yearand- week.
 - object_references_drop_partition(year: int, week: int, cur=None) None[source]#
- Delete the partition of the object_references table for the given ISO - yearand- week.
 - object_references_list_partitions(cur=None) List[ObjectReferencesPartition][source]#
- List existing partitions of the object_references table, ordered from oldest to the most recent.