.. _mirror_operations: Mirror Operations ================= .. _mirror: Description ----------- A mirror is a full copy of the |swh| archive, operated independently from the Software Heritage initiative. A minimal mirror consists of two parts: - the graph storage (typically an instance of :ref:`swh.storage `), which contains the Merkle DAG structure of the archive, *except* the actual content of source code files (AKA blobs), - the object storage (typically an instance of :ref:`swh.objstorage `), which contains all the blobs corresponding to archived source code files. However, a usable mirror needs also to be accessible by others. As such, a proper mirror should also allow to: - navigate the archive copy using a Web browser and/or the Web API (typically using the :ref:`the web application `), - have minimal search capabilities (typically using :ref:`the swh-search service ` with an Elasticsearch backend), - retrieve data from the copy of the archive (typically using the :ref:`the vault service `) A mirror is initially populated and maintained up-to-date by consuming data from the |swh| Kafka-based :ref:`journal ` and retrieving the blob objects (file content) from the |swh| :ref:`object storage `. .. note:: It is not required that a mirror be deployed using the |swh| software stack. Other technologies, including different storage methods, can be used. But we will focus in this documentation to the case of mirror deployment using the |swh| software stack. .. thumbnail:: ../images/mirror-architecture.svg General view of the |swh| mirroring architecture. .. Note:: This general view is very simplified and does not show all the services involved in hosting and operating a mirror. See the :ref:`planning-a-mirror` for a complete description of the requirements to host a mirror. .. Note:: Hosting a complete mirror is a complex task, involving the deployment of dozens of inter related (micro-)services. It should be planned and operated carefully, using state-of-art ops practices (cloud-based, or using container orchestration tools on an elastic execution platform like kubernetes_, `docker swarm`_, or using tools like Ansible_ or `Salt Stack`_). .. Important:: It is **strongly** recommended to start with a simple `docker swarm`_ based deployment (this can be done on a single machine) as described in :ref:`mirror_deploy`. Mirroring the Graph Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The replication of the graph is based on a journal using Kafka_ as event streaming platform. On the Software Heritage side, every addition made to the archive consist of the addition of a :ref:`data-model` object. The new object is also serialized as a msgpack_ bytestring which is used as the value of a message added to a Kafka topic dedicated to the object type. The main Kafka topics for the |swh| :ref:`data-model` are: - `swh.journal.objects.content` - `swh.journal.objects.directory` - `swh.journal.objects.extid` - `swh.journal.objects.metadata_authority` - `swh.journal.objects.metadata_fetcher` - `swh.journal.objects.origin_visit_status` - `swh.journal.objects.origin_visit` - `swh.journal.objects.origin` - `swh.journal.objects.raw_extrinsic_metadata` - `swh.journal.objects.release` - `swh.journal.objects.revision` - `swh.journal.objects.skipped_content` - `swh.journal.objects.snapshot` In order to set up a mirror of the graph, one needs to deploy a stack capable of retrieving all these topics and store their content reliably. For example a Kafka cluster configured as a replica of the main Kafka broker hosted by |swh| would do the job (albeit not in a very useful manner by itself). A more useful mirror can be set up using the :ref:`storage ` component with the help of the special service named `replayer` provided by the :mod:`swh.storage.replay` module. .. TODO: replace this previous link by a link to the 'swh storage replay' command once available, and ideally once https://github.com/sphinx-doc/sphinx/issues/880 is fixed Mirroring the Object Storage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File content (blobs) are *not* directly stored in messages of the `swh.journal.objects.content` Kafka topic, which only contains metadata about them, such as various kinds of cryptographic hashes. A separate component is in charge of replicating blob objects from the archive and stored them in the local object storage instance. A separate `swh-journal` client should subscribe to the `swh.journal.objects.content` topic to get the stream of blob objects identifiers, then retrieve corresponding blobs from the main Software Heritage object storage, and store them in the local object storage. A reference implementation for this component is available in :ref:`content replayer `. Installation ------------ When using the |swh| software stack to deploy a mirror, a number of |swh| software components must be installed (cf. architecture diagram above). .. Note:: It is **not** recommended to try to deploy each |swh| service individually. You should rather start from the example docker-based deployment project :ref:`described here `. A `docker swarm`_ based deployment solution is provided as a working example of the mirror stack, see :ref:`mirror_deploy`. It is strongly recommended to start from there before planning a production-like deployment. .. _Kafka: https://kafka.apache.org/ .. _msgpack: https://msgpack.org .. _`docker swarm`: https://docs.docker.com/engine/swarm/ .. _Ansible: https://docs.ansible.com/ansible/latest/index.html .. _kubernetes: https://kubernetes.io/fr/ .. _`Salt Stack`: https://saltproject.io/ You may want to read: - :ref:`mirror_monitor` to learn how to monitor your mirror and how to report its health back the |swh|. - :ref:`mirror_seaweedfs` to have some tips and explanations on how to use SeaweedFS_ as objstorage backend for a mirror. - :ref:`mirror_onboard` for the |swh| side view of adding a new mirror. .. _SeaweedFS: https://github.com/seaweedfs/seaweedfs/ .. toctree:: :hidden: planning deploy seaweedfs takedown-notices onboard monitor