Software Heritage - Indexer#
Tools to compute multiple indexes on SWH’s raw contents:
- content: - mimetype 
- fossology-license 
- metadata 
 
- origin: - metadata (intrinsic, using the content indexer; and extrinsic) 
 
An indexer is in charge of:
- looking up objects 
- extracting information from those objects 
- store those information in the swh-indexer db 
There are multiple indexers working on different object types:
content indexer: works with content sha1 hashes
revision indexer: works with revision sha1 hashes
origin indexer: works with origin identifiers
Indexation procedure:
- receive batch of ids 
- retrieve the associated data depending on object type 
- compute for that object some index 
- store the result to swh’s storage 
Current content indexers:
- mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype 
- fossology-license (queue swh_indexer_fossology_license): compute the license 
- metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta vocabulary) 
Current origin indexers:
- metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta and ForgeFed vocabularies)