swh.model.discovery module#
Primitives for finding unknown content efficiently.
- class swh.model.discovery.Sample(contents, skipped_contents, directories)#
- Bases: - tuple- Create new instance of Sample(contents, skipped_contents, directories) - contents#
- Alias for field number 0 
 - directories#
- Alias for field number 2 
 - skipped_contents#
- Alias for field number 1 
 
- class swh.model.discovery.ArchiveDiscoveryInterface(contents: List[Content], skipped_contents: List[SkippedContent], directories: List[Directory])[source]#
- Bases: - Protocol- Interface used in discovery code to abstract over ways of connecting to the SWH archive (direct storage, web API, etc.) for all methods needed by discovery algorithms. - skipped_contents: List[SkippedContent]#
 - content_missing(contents: List[bytes]) Iterable[bytes][source]#
- List content missing from the archive by sha1 
 
- class swh.model.discovery.BaseDiscoveryGraph(contents, skipped_contents, directories, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#
- Bases: - object- Creates the base structures and methods needed for discovery algorithms. Subclasses should override - get_sampleto affect how the discovery is made.- The update_info_callback is an optional argument that will get called for each new piece of information we get. The callback arguments are (content, known). - content: the relevant model.Content object, - known: a boolean, True if the file is known to the archive False otherwise. - mark_known(entries: Iterable[bytes])[source]#
- Mark - entriesand those they imply as known in the SWH archive
 - mark_unknown(entries: Iterable[bytes])[source]#
- Mark - entriesand those they imply as unknown in the SWH archive
 - get_sample() Sample[source]#
- Return a three-tuple of samples from the undecided sets of contents, skipped contents and directories respectively. These samples will be queried against the storage which will tell us which are known. 
 - do_query(archive: ArchiveDiscoveryInterface, sample: Sample) None[source]#
- Given a three-tuple of samples, ask the archive which are known or unknown and mark them as such. 
 
- class swh.model.discovery.RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#
- Bases: - BaseDiscoveryGraph- Use a random sampling using only directories. - This allows us to find a statistically good spread of entries in the graph with a smaller population than using all types of entries. When there are no more directories, only contents or skipped contents are undecided if any are left: we send them directly to the storage since they should be few and their structure flat. 
- swh.model.discovery.filter_known_objects(archive: ArchiveDiscoveryInterface, update_info_callback: Callable[[Any, bool], None] | None = None)[source]#
- Filter - archive’s- contents,- skipped_contentsand- directoriesto only return those that are unknown to the SWH archive using a discovery algorithm.- The update_info_callback is an optional argument that will get called for each new piece of information we get. The callback arguments are (content, known). - content: the relevant model.Content object, - known: a boolean, True if the file is known to the archive False otherwise.