Software Heritage GraphQL API#
This repository holds the development of Software Heritage GraphQL API. The service is available at https://archive.softwareheritage.org/graphql/ A staging version of this service is available at https://graphql.staging.swh.network
Running locally#
Refer to https://docs.softwareheritage.org/devel/getting-started.html#getting-started for running software heritage services locally.
If you wish to run SWH-GraphQL independently, and have access to SWH storage services, following make targets can be used.
make run-dev: Use the config file at
swh/graphql/config/dev.yml
and start the service in auto-reload mode using uvicornmake run-dev-stable: Use the config file at
swh/graphql/config/dev.yml
and start the service using uvicornmake run-dev-docker: Run the service inside a docker container and Use the config file at
swh/graphql/config/dev.yml
make run-wsgi-docker: Run the service inside a docker container and Use the config file at
swh/graphql/config/staging.yml
visit http://localhost:8000 to use the query explorer
Running a query#
The easiest way to run a query is using the query explorer here https://archive.softwareheritage.org/graphql/ Please login using the SWH credentials if you wish to run bigger queries.
Using curl#
curl -i -H 'Content-Type: application/json' -H "Authorization: bearer your-api-token" -X POST -d '{"query": "query {origins(first: 2) {nodes {url}}}"}' https://archive.softwareheritage.org/graphql/
Using Python requests#
import requests
url = "https://archive.softwareheritage.org/graphql/"
query = """
{
origins(first: 2) {
pageInfo {
hasNextPage
endCursor
}
edges {
node {
url
}
}
}
}
"""
json = {"query" : query}
headers = {}
# api_token = "your-api-token"
# headers = {'Authorization': 'Bearer %s' % api_token}
r = requests.post(url=url, json=json, headers=headers)
print (r.json())
Pagination#
The server can return either a Node object (eg: Origin type) or a Connection object. All the connection objects can be paginated using cursors.
All the entrypoints that return a Connection (eg: origins entrypoint that returns an OriginConnection type) will accept the following arguments.
first
: An integer. The number of objects to return a.k.a the page size. This is a mandatory argument for most of the connections. The maximum value offirst
is limited to 1000. There are some entrypoints where thefirst
argument is not mandatory. (eg: statuses inside a Visit type) Default value offirst
will be set to 50 in those cases.after
: A string. The cursor to be used for pagination. If no cursor is given, the server will returnfirst
number of objects from the beginning.
Every connection type will have the following fields.
edges
: This will be a list of objects with the following fields.node
: The requested SWH object.cursor
: Cursor to the specific object (item cursor). (This field is not available in all connections for the time being). This cursor can be used to paginate starting from this particular object.
nodes
: A list of SWH objects. This is a shortcut to access the SWH objects without going through theedges
layer, but it not possible to get an item cursor using nodes.pageInfo
: Data to be used for querying subsequent pages. Contains the following fields.endCursor
: Cursor to request the next page.hasNextPage
A boolean value.
totalCount
: Total number of objects available in the connection after applying the given filters. This is not available for many connections for the time being.
Example for pagination using edges#
Get the contents of a directory
query getDirectoryContent {
directory(swhid: "swh:1:dir:b0b6050efa0634ecded8508a7ab9c6774ca69ac8") {
entries(first: 5, after: "NQ==") {
totalCount
edges {
node {
name {
text
}
}
cursor
}
pageInfo {
endCursor
hasNextPage
}
}
}
}
Example for pagination using nodes#
query getDirectoryContent {
directory(swhid: "swh:1:dir:b0b6050efa0634ecded8508a7ab9c6774ca69ac8") {
entries(first: 2, after: "NTA=") {
totalCount
nodes {
name {
text
}
}
pageInfo {
endCursor
hasNextPage
}
}
}
}
Design#
Architecture#
On a high level, a multitier pattern is used
Schema represents the user facing part, resolvers the controller and backend to fetch data
Starlette is used as the application framework
Ariadne, a
schema first
python library is used.Ariadne is built on top of graphql-core.The library is used only for binding resolvers to the schema and for some simple actions like static cost calculation.This is not a hard dependency, and can be replaced if needed.
Schema#
Schema is written in SDL.
Schema is following relay specifications using Nodes and Connections.
Naming: lower Camelcase is used for fields and input arguments, CamelCase for types and enums
Resolvers#
-
Inherit
resolvers.base_node.BaseNode
to create a node resolverUse this to return a single objecteg: Originoverride_get_node_data
to return the real data. This can return either an object or a map.override_get_node_from_data
in case you want to format the data before returning. -
Inherit
resolvers.base_connection.BaseConnection
to create a connection resolverUse this to return a paginated listPagination response will be automatically created with edges and PageInfo in the response as per by the relay specification.eg: Originsoverride_get_connection_data
to return the real datahandlers are available to read general user arguments likefirst
andafter
and can be overridden for the specific case. -
Inherit
resolvers.base_connection.BaseList
to create a simple list resolverUse this to return a non paginated simple listeg: resolveSWHID (This is returning a simple list, instead of a node, to handle the rare case of SWHID collisions)override_get_results
to return the real data Others
Binary string
To return an object with a string and its binary value
Date with offset
To return an object with an isoformat data and its offset as a Binary string
Resolver factory#
resolvers.resolvers
module, which is generally a lot more complex.snapshot-headbranch
key
and handled by SnapshotHeadBranchNode
classrevision-parents
(parents inside a revision) is handled by ParentRevisionConnection
class.directory-directoryentry
(a directory-entry from a directory) is handled by DirEntryInDirectoryNode
class.Custom Scalars#
A few custom scalars are defined here.
SWHID
Serializer and parser for SWHID
ID
A string used as cache key for JS clients
DateTime
Serializer for Datetime objects
Targets and union types#
parentObject {
...
target {
...
type
SWHID // This could be any identifier (can be a hash)
node { // The real object
..
}
}
}
Errors#
Client errors not reported in Sentry are#
ObjectNotFoundError
Requested object is missing. Only node resolvers will raise this. (Similar to a 404 error).
PaginationError
Issue with pagination arguments, invalid cursor too big first argument etc. This is a validation error.
InvalidInputError
Issues like invalid SWHID, or an invalid sort order from a client. This is a validation error.
QuerySyntaxError
Error in client query, caught by ariadne while parsing. This is a validation error
Other possible errors (reported in Sentry)#
DataError
Errors related to authentication
Unhandled errors
Backends#
Archive
All the calls to swh-storage
Search
All the calls to swh-search
Middlewares#
LogMiddleware
Used to send statsd matrices
Local paginations#
eg: DirectoryEntry
Tests#
Add a new Scalar field inside a type#
Add the field in the schema
Add the cost associated. (Ideally this should be 0 for all scalars)
Find the resolver class from
resolvers.resolvers.py
(This step can be skipped in most of the cases by directly checking the factory dict)Add a field either in the backend response or as a property in the resolver class.
If the field involves a new backend call or any extra computing, add it as new type instead of a field. (By following the steps below)
Add a new type field inside another type#
Add the type in the schema
Add a field along with arguments in the parent type connecting to the newly added type.
Add the cost associated. Multipliers can be used if needed.
Add the resolver class using the right base class and override the required function/s.
Add a backend function (if necessary) to fetch data
Connect the route in
resolvers.resolvers.py
Bind the class in the resolver factory
resolvers.resolver_factory.py
.You have to add the type in the type in app.py, in case you have sub fields to resolve.
eg: MR to add an entry field in the Directory object. It is created as new resolver object as has a cost.
Add a new entrypoint#
This is same as adding a new type inside another type. The parent type will be the root (query) in this case.
Cost calculator#
Static and calculated by ariadne
This check is executed before running the query.It may not be a good idea to use this to calculate credits as this always assumes the maximum possible cost of a query.
Client#
GraphiQL based and is returned from the GrapQL server itself.
Future works#
Indexers
Add an indexer backend
Metadata
Add a metadata backend.The major issue here is it is not well structured. It is not very helpful to return raw json.Disable cors
Cors is enabled for all the domains now. Limit only to legit clients
More backends
Graph backend, Provenance backend
More fields
More fields can be added with more backends.Eg: ‘firstSeen’ field in a content object can be added from provenanceMutations
Write APIs
Dynamic query cost calculator and partial response
To calculate exact query cost.It is also possible to return part of the response depending on the cost.Advanced rate limiter
Maybe as a different service. To support user level query quota and time based restrictions.
De-couple client and server
Client UI is returned by the same service. It is a good idea to make them separate. swh-client is a basic working copyA simple independent client is available here .Make fully relay complaint
Missing startCursor and itemCursor in pagination. totalCount (not in relay spec) is missing from most of the connections.
Sentry transactions
Create a transaction per query and add all the objects.
Backend/performance improvements by writing new queries with joins
Write bigger backend queries to avoid multiple storage hits.
Add type to resolver arguments
User inputs are not typed now and are available in self.kwargs. Add types to each of the inputs
Remove local paginations
Move all the pagination to the backend
Cache
Ideally in the storage level.
Address FIXMEs
Most of them are related to local pagination
Make resolvers asynchronous
Could improve performance in case a query requests multiple types in a single request.