swh.provenance.luigi module#
Luigi tasks to help compute the provenance of content blobs#
This module contains Luigi tasks driving the computation of the Provenance index.
- class swh.provenance.luigi.ListProvenanceNodes(*args, **kwargs)[source]#
- Bases: - Task- Lists all nodes reachable from releases and ‘head revisions’. - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 
- class swh.provenance.luigi.ComputeEarliestTimestamps(*args, **kwargs)[source]#
- Bases: - Task- Creates an array storing, for each directory/content SWHIDs, the author date of the first revision/release that contains it. - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 
- class swh.provenance.luigi.ListDirectoryMaxLeafTimestamp(*args, **kwargs)[source]#
- Bases: - Task- Creates a file that contains all directory/content SWHIDs, along with the first revision/release author date and SWHIDs they occur in. - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 - requires() Dict[str, Task][source]#
- Returns - LocalGraphand- ComputeEarliestTimestampsinstances.
 
- class swh.provenance.luigi.ComputeDirectoryFrontier(*args, **kwargs)[source]#
- Bases: - Task- Creates a file that contains the “directory frontier” as defined by swh-provenance. - In short, it is a directory which directly contains a file (not a directory), which is a non-root directory in a revision newer than the directory timestamp computed by ListDirectoryMaxLeafTimestamp. - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - max_ram_mb = <luigi.parameter.IntParameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 - requires() Dict[str, Task][source]#
- Returns - LocalGraphand- ListDirectoryMaxLeafTimestampinstances.
 
- class swh.provenance.luigi.ListFrontierDirectoriesInRevisions(*args, **kwargs)[source]#
- Bases: - Task- Creates a file that contains the list of revision any “frontier directory” (as defined by swh-provenance) is in. - While a directory is considered frontier only relative to a revision, the produced file contains the list of all revisions a directory is in, for directories which are frontier for any revision. - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - max_ram_mb = <luigi.parameter.IntParameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 - requires() Dict[str, Task][source]#
- Returns - LocalGraphand- ComputeDirectoryFrontierinstances.
 
- class swh.provenance.luigi.ListContentsInRevisionsWithoutFrontier(*args, **kwargs)[source]#
- Bases: - Task- Creates a file that contains the list of (file, revision) where the file is reachable from the revision without going through any “directory frontier” as defined by swh-provenance. - In short, it is a directory which directly contains a file (not a directory), which is a non-root directory in a revision newer than the directory timestamp computed by ListDirectoryMaxLeafTimestamp. - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - max_ram_mb = <luigi.parameter.IntParameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 - requires() Dict[str, Task][source]#
- Returns - LocalGraphand- ListDirectoryMaxLeafTimestampinstances.
 
- class swh.provenance.luigi.ListContentsInFrontierDirectories(*args, **kwargs)[source]#
- Bases: - Task- Enumerates all contents in all directories returned by - ComputeDirectoryFrontier.- local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - max_ram_mb = <luigi.parameter.IntParameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 - requires() Dict[str, Task][source]#
- Returns - LocalGraphand- ComputeDirectoryFrontierinstances.
 
- class swh.provenance.luigi.ListRevisionsInOrigins(*args, **kwargs)[source]#
- Bases: - Task- Enumerates all revisions (as selected by the - provenance_node_filterin all origins.- local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - property resources#
- Returns the value of - self.max_ram_mb
 
- class swh.provenance.luigi.UploadProvenanceDatabase(*args, **kwargs)[source]#
- Bases: - _ParquetToS3Task- Uploads to S3 the result of: * - ListProvenanceNodes, *- ListContentsInFrontierDirectories, *- ListContentsInRevisionsWithoutFrontier, *- ListFrontierDirectoriesInRevisions, and *- ListRevisionsInOrigins,- local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - dataset_name = <luigi.parameter.Parameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - max_ram_mb = <luigi.parameter.IntParameter object>#
 
- class swh.provenance.luigi.RunProvenance(*args, **kwargs)[source]#
- Bases: - WrapperTask- (Transitively) depends on all provenance tasks - local_export_path = <luigi.parameter.PathParameter object>#
 - local_graph_path = <luigi.parameter.PathParameter object>#
 - dataset_name = <luigi.parameter.Parameter object>#
 - graph_name = <luigi.parameter.Parameter object>#
 - provenance_dir = <luigi.parameter.PathParameter object>#
 - provenance_node_filter = <luigi.parameter.Parameter object>#
 - max_ram_mb = <luigi.parameter.IntParameter object>#
 - requires()[source]#
- Returns - ListContentsInFrontierDirectoriesand- ListContentsInRevisionsWithoutFrontier