Software Heritage GraphQL API#
This repository holds the development of Software Heritage GraphQL API. The service is available at https://archive.softwareheritage.org/graphql/ A staging version of this service is available at https://graphql.staging.swh.network
Running locally#
Refer to https://docs.softwareheritage.org/devel/getting-started.html#getting-started for running software heritage services locally.
If you wish to run SWH-GraphQL independently, and have access to SWH storage services, following make targets can be used.
- make run-dev: Use the config file at - swh/graphql/config/dev.ymland start the service in auto-reload mode using uvicorn
- make run-dev-stable: Use the config file at - swh/graphql/config/dev.ymland start the service using uvicorn
- make run-dev-docker: Run the service inside a docker container and Use the config file at - swh/graphql/config/dev.yml
- make run-wsgi-docker: Run the service inside a docker container and Use the config file at - swh/graphql/config/staging.yml
- visit http://localhost:8000 to use the query explorer 
Running a query#
The easiest way to run a query is using the query explorer here https://archive.softwareheritage.org/graphql/ Please login using the SWH credentials if you wish to run bigger queries.
Using curl#
curl -i -H 'Content-Type: application/json' -H "Authorization: bearer your-api-token" -X POST -d '{"query": "query {origins(first: 2) {nodes {url}}}"}' https://archive.softwareheritage.org/graphql/
Using Python requests#
import requests
url = "https://archive.softwareheritage.org/graphql/"
query = """
{
  origins(first: 2) {
    pageInfo {
      hasNextPage
        endCursor
    }
    edges {
      node {
        url
      }
    }
  }
}
"""
json = {"query" : query}
headers = {}
# api_token = "your-api-token"
# headers = {'Authorization': 'Bearer %s' % api_token}
r = requests.post(url=url, json=json, headers=headers)
print (r.json())
Pagination#
The server can return either a Node object (eg: Origin type) or a Connection object. All the connection objects can be paginated using cursors.
All the entrypoints that return a Connection (eg: origins entrypoint that returns an OriginConnection type) will accept the following arguments.
- first: An integer. The number of objects to return a.k.a the page size. This is a mandatory argument for most of the connections. The maximum value of- firstis limited to 1000. There are some entrypoints where the- firstargument is not mandatory. (eg: statuses inside a Visit type) Default value of- firstwill be set to 50 in those cases.
- after: A string. The cursor to be used for pagination. If no cursor is given, the server will return- firstnumber of objects from the beginning.
Every connection type will have the following fields.
- edges: This will be a list of objects with the following fields.- node: The requested SWH object.
- cursor: Cursor to the specific object (item cursor). (This field is not available in all connections for the time being). This cursor can be used to paginate starting from this particular object.
 
- nodes: A list of SWH objects. This is a shortcut to access the SWH objects without going through the- edgeslayer, but it not possible to get an item cursor using nodes.
- pageInfo: Data to be used for querying subsequent pages. Contains the following fields.- endCursor: Cursor to request the next page.
- hasNextPageA boolean value.
 
- totalCount: Total number of objects available in the connection after applying the given filters. This is not available for many connections for the time being.
Example for pagination using edges#
Get the contents of a directory
query getDirectoryContent {
  directory(swhid: "swh:1:dir:b0b6050efa0634ecded8508a7ab9c6774ca69ac8") {
    entries(first: 5, after: "NQ==") {
      totalCount
      edges {
        node {
          name {
            text
          }
        }
        cursor
      }
      pageInfo {
        endCursor
        hasNextPage
      }
    }
  }
}
Example for pagination using nodes#
query getDirectoryContent {
  directory(swhid: "swh:1:dir:b0b6050efa0634ecded8508a7ab9c6774ca69ac8") {
    entries(first: 2, after: "NTA=") {
      totalCount
      nodes {
        name {
          text
        }
      }
      pageInfo {
        endCursor
        hasNextPage
      }
    }
  }
}
Design#
Architecture#
- On a high level, a multitier pattern is used - Schema represents the user facing part, resolvers the controller and backend to fetch data 
- Starlette is used as the application framework 
- Ariadne, a - schema firstpython library is used.Ariadne is built on top of graphql-core.The library is used only for binding resolvers to the schema and for some simple actions like static cost calculation.This is not a hard dependency, and can be replaced if needed.
Schema#
- Schema is written in SDL. 
- Schema is following relay specifications using Nodes and Connections. 
- Naming: lower Camelcase is used for fields and input arguments, CamelCase for types and enums 
Resolvers#
- 
Inheritresolvers.base_node.BaseNodeto create a node resolverUse this to return a single objecteg: Originoverride_get_node_datato return the real data. This can return either an object or a map.override_get_node_from_datain case you want to format the data before returning.
- 
Inheritresolvers.base_connection.BaseConnectionto create a connection resolverUse this to return a paginated listPagination response will be automatically created with edges and PageInfo in the response as per by the relay specification.eg: Originsoverride_get_connection_datato return the real datahandlers are available to read general user arguments likefirstandafterand can be overridden for the specific case.
- 
Inheritresolvers.base_connection.BaseListto create a simple list resolverUse this to return a non paginated simple listeg: resolveSWHID (This is returning a simple list, instead of a node, to handle the rare case of SWHID collisions)override_get_resultsto return the real data
- Others - Binary string - To return an object with a string and its binary value 
- Date with offset - To return an object with an isoformat data and its offset as a Binary string 
 
Resolver factory#
resolvers.resolvers module, which is generally a lot more complex.snapshot-headbranch key
and handled by SnapshotHeadBranchNode classrevision-parents (parents inside a revision) is handled by ParentRevisionConnection class.directory-directoryentry (a directory-entry from a directory) is handled by DirEntryInDirectoryNode class.Custom Scalars#
A few custom scalars are defined here.
- SWHID - Serializer and parser for SWHID 
- ID - A string used as cache key for JS clients 
- DateTime - Serializer for Datetime objects 
Targets and union types#
parentObject {
  ...
  target {
    ...
    type
    SWHID   // This could be any identifier (can be a hash)
    node {  // The real object
      ..
    }
  }
}
Errors#
Client errors not reported in Sentry are#
- ObjectNotFoundError - Requested object is missing. Only node resolvers will raise this. (Similar to a 404 error). 
- PaginationError - Issue with pagination arguments, invalid cursor too big first argument etc. This is a validation error. 
- InvalidInputError - Issues like invalid SWHID, or an invalid sort order from a client. This is a validation error. 
- QuerySyntaxError - Error in client query, caught by ariadne while parsing. This is a validation error 
Other possible errors (reported in Sentry)#
- DataError 
- Errors related to authentication 
- Unhandled errors 
Backends#
- Archive - All the calls to swh-storage 
- Search - All the calls to swh-search 
Middlewares#
- LogMiddleware - Used to send statsd matrices 
Local paginations#
eg: DirectoryEntry
Tests#
Add a new Scalar field inside a type#
- Add the field in the schema 
- Add the cost associated. (Ideally this should be 0 for all scalars) 
- Find the resolver class from - resolvers.resolvers.py(This step can be skipped in most of the cases by directly checking the factory dict)
- Add a field either in the backend response or as a property in the resolver class. 
- If the field involves a new backend call or any extra computing, add it as new type instead of a field. (By following the steps below) 
Add a new type field inside another type#
- Add the type in the schema 
- Add a field along with arguments in the parent type connecting to the newly added type. 
- Add the cost associated. Multipliers can be used if needed. 
- Add the resolver class using the right base class and override the required function/s. 
- Add a backend function (if necessary) to fetch data 
- Connect the route in - resolvers.resolvers.py
- Bind the class in the resolver factory - resolvers.resolver_factory.py.
- You have to add the type in the type in app.py, in case you have sub fields to resolve. 
- eg: MR to add an entry field in the Directory object. It is created as new resolver object as has a cost. 
Add a new entrypoint#
- This is same as adding a new type inside another type. The parent type will be the root (query) in this case. 
Cost calculator#
- Static and calculated by ariadne This check is executed before running the query.It may not be a good idea to use this to calculate credits as this always assumes the maximum possible cost of a query.
Client#
- GraphiQL based and is returned from the GrapQL server itself. 
Future works#
- Indexers - Add an indexer backend 
- Metadata Add a metadata backend.The major issue here is it is not well structured. It is not very helpful to return raw json.
- Disable cors - Cors is enabled for all the domains now. Limit only to legit clients 
- More backends - Graph backend, Provenance backend 
- More fields More fields can be added with more backends.Eg: ‘firstSeen’ field in a content object can be added from provenance
- Mutations - Write APIs 
- Dynamic query cost calculator and partial response To calculate exact query cost.It is also possible to return part of the response depending on the cost.
- Advanced rate limiter - Maybe as a different service. To support user level query quota and time based restrictions. 
- De-couple client and server Client UI is returned by the same service. It is a good idea to make them separate. swh-client is a basic working copyA simple independent client is available here .
- Make fully relay complaint - Missing startCursor and itemCursor in pagination. totalCount (not in relay spec) is missing from most of the connections. 
- Sentry transactions - Create a transaction per query and add all the objects. 
- Backend/performance improvements by writing new queries with joins - Write bigger backend queries to avoid multiple storage hits. 
- Add type to resolver arguments - User inputs are not typed now and are available in self.kwargs. Add types to each of the inputs 
- Remove local paginations - Move all the pagination to the backend 
- Cache - Ideally in the storage level. 
- Address FIXMEs - Most of them are related to local pagination 
- Make resolvers asynchronous - Could improve performance in case a query requests multiple types in a single request.