Vespa is Yahoo’s fully autonomous and self-sufficient big data processing and serving engine. It aims at serving results of queries on huge amounts of data in real time. An example of this would be serving search results or recommendations to a user.
Yahoo – or Oath – recently made Vespa open source on GitHub. At this time, the documentation focuses on the user experience. Open sourcing the code will attract new developers, especially when it comes to its documentation, like it did with Apache Hadoop more than 10 years ago. Quoting Jon Bratseth, Distinguished Architect for Vespa, “by releasing Vespa, we are making it easy for anyone to build applications that can compute responses to user requests, over large datasets, at real time and at internet scale”.
This article compiles some of my notes taken on a journey to better understand the nature of Vespa as well as my feedback and opinion after reading the documentation, trying a few examples and roaming the source code. I make sure to explain any technical vocabulary specific to Vespa, so if you see a word you don’t know, odds are it has a paragraph dedicated to it later on. Let’s see how Vespa lays the groundwork for building powerful and scalable search and recommendation applications.
Vespa lets you:
- Select content using SQL-like queries and unstructured search
- Organize all matches to generate data-driven pages
- Rank matches by handwritten or machine-learned relevance models
- Write persistent data in real time thousands of times per second per node
- Grow, shrink and reconfigure clusters while serving and writing
Vespa is built on a core concept familiar to Hadoop users, to deploy and execute the code as close as possible to the data. Additionally, it provides access to processed data obtained from outside sources if required. Vespa applications can write, lookup, and query documents stored in the engine. They can also execute data dumps and handle outgoing network requests, such as exposing a custom HTTP REST services.
Vespa divides the nodes that are part of its system into three categories:
- The container clusters: host the system’s applications and most of its middleware.
- The content clusters: store the system’s data and do most of its processing.
- The administration and configuration cluster: rule over all the other components.
Vespa applications, built from component chains, are developed and hosted in JDisc containers. JDisc containers are the basis for the search container and document processing. These applications make use of Vespa’s elastic storage system.
The entire system is administered by a comprehensive Cloud Configuration System, a simple tool that grants Vespa users control over resource allocation.
Vespa seems to have a complete and utter lack of security, both external and internal.
While Vespa’s capabilities are reported to be spectacular, how it accomplishes such speed and scalability remains unclear. After going through the project’s documentation, I have a good idea of how to use Vespa and what to use it for, but I cannot clearly visualize how it functions in the background. The most I’ve discovered is that the Vespa engine performs elaborate indexing of data and metadata in order to serve queries with very low latency.
Write: Add, replace, remove, or edit documents.
Lookup: Find a document or subset of documents with their IDs.
Query: Select documents whose fields match a boolean combination of conditions. The selection can be sorted, or ranked using a given ranking expression (math, business logic, or a machine-learned ranking model). The selection can also be grouped by field values, which can then be aggregated. To avoid needlessly shipping data back to the container cluster, any aggregation is done in a distributed fashion.
Data dump: Get any raw data that matches a given criterion.
Custom network request: Handled by application components deployed on content clusters.
Vespa divides its system into three cluster types: container clusters, content clusters, and an administration/configuration cluster.
Container clusters are stateless. They host the system’s application components and provide a network layer for handling and issuing remote requests, for which HTTP is readily available but other protocols can be plugged in.
Content clusters reliably store data by replicating it over multiple nodes; the number of copies is specified by the application. The clusters maintain distributed indices of data for searches and selects. These clusters can be grown or shrunk while online by editing the application package.
Lookups are routed directly to the correct node, however, queries spread out over the subset of nodes which contain the relevant documents. These queries use distributed algorithms with steps back and forth between container and content nodes, which allows for low-latency.
There is only one such cluster because it controls the other clusters. It derives the low-level configuration required by each node of each other cluster, based on the high-level configuration given be the application developer.
Application packages can be changed, redeployed, or inspected through an HTTP REST API or through CLI commands. The process runs over Zookeeper, to ensure that changes to the configuration are singular and consistent while avoiding any single point of failure.
On the container clusters, application components and plug-ins are stored in JDisc containers (Data Intensive Java Serving Container), also known as middleware containers (MWC). These are the basis of the Search Container and the Document Processing Container: search containers process queries and return results, while docproc containers return documents fed to them, after having processed them. This structure allows components to process or modify all queries, results, and documents.
JDisc containers provide a development and hosting environment for processing components, through an OSGi based component model which allows deployment and un-deployment of components to a cluster of running instances. They also provide a general model of Documents, Queries, Results, and the components processing them. Finally, they provide a composition mechanism – chains – for building applications from components and managing the execution order dependencies. This allows components to be developed by different teams.
Vespa uses the following application components: searchers, document processors, handlers, and renderers. Chained components are executed serially and each provides some service or transform.
Each component provides a list of services that should be before it in the chain (its dependants) as well as a list of services that should be after it in the chain (its dependencies). These ordering constraints define the execution order.
Fundamentally, a chain is simply a set of application components with dependencies.
Vespa stores application data in its content clusters. Which nodes to use for storage is specified in each application’s configuration. Applications can grow or shrink their hardware while serving queries and accepting writes as normal.
When such a change occurs, data is automatically redistributed in the background. Vespa applies minimal data movement required for even data distribution. No restarts are needed: all one needs to do is edit the hardware list in the application’s configuration and redeploy the app.
The same mechanism is used to adapt to faulty nodes or disks in the content clusters.
To date, Vespa is not optimised to function across multiple datacenters because their distribution algorithm is not suitable for global distribution as it does not attempt to minimize bandwidth usage between data centers, nor can it seamlessly tolerate datacenter-wide outages. Currently, a solution to this is to run an instance of Vespa in each data center and have applications write to all instances. Vespa’s development team realizes this is not ideal and is working on a solution.
Storage nodes do the actual storing. They provide a Service Provider Interface (SPI) that abstracts how documents are stored in the elastic system. This interface is implemented by proton nodes, which make up Vespa’s search core.
Whenever a Document is written to or read from the system it goes through a distributor. A distributor’s main role is to decide which bucket a document is stored in.
In order to handle a large number of documents, Vespa groups them in buckets. These buckets are split and merged as they grow and shrink over time. The SPI links the elastic bucket system and the actual document storage. Buckets contain tombstones of recently removed documents, which are removed after a configurable amount of time.
At first sight, Vespa’s Cloud Configuration System (CCS) system does not seem to be complicated.
We can divide configuration into three categories:
- Node system level configuration: set system time, user privileges, …
- Package management: ensure the correct software is installed on all nodes.
- Cloud configuration: start the configured set of processes on each node with their configured startup parameters and provide dynamic configuration to the modules run by these services. Configuration here is any data which cannot be fixed at compile time and is static most of the time.
Vespa relies almost exclusively on cloud configuration, so Vespa users can allow cloud nodes to have the same software packages because differences in behavior are managed entirely through the CCS. The complexity of node variations is thus managed through cloud configuration rather than on multiple systems.
Configuring the Vespa system is divided into assembly and delivery.
Configuration assembly is based on the application package, which is sent to the configuration server. A configuration model of the distributed system is assembled, accepting higher level abstractions as input and automatically deriving detailed configurations with the correct interdependencies. Once validated this configuration can be activated, which will update all relevant application configurations.
Configuration delivery queries the configuration model previously assembled to obtain the concrete configurations of the components of the system. Components subscribe to certain configurations and query the configuration server API before updating. Subscriptions are consecutive long-timeouts gets, which work in a manner comparable to push notifications.
Each node has a configuration proxy so that application components do not interact directly and redundantly with the configuration cluster.
CCS makes several assumptions about the nodes using it:
- All nodes have the software packages needed to run the configuration system and any services which will be configured to run on the node. This usually means that all the nodes have the same software, although this is not a requirement.
- All nodes have set
VESPA_CONFIGSERVERS, which defines the configuration servers to query.
- All nodes know its fully qualified domain name.
It is unclear from its documentation how Vespa manages node resources when multiple apps attempt to use it. It seems that a “priority” system is in place, but I couldn’t find details on how it works or how to use it effectively.
There is virtually no security included in Vespa.
It is crucial that the Vespa hosts be isolated. They must not be exposed directly to the Internet. Whether you decide to solve this problem with a network proxy or with another solution is up to you.
All Vespa hosts must be able to communicate with each other. However, none of the internal protocols are authenticated. Therefore, any external connection must come from a set of trusted hosts. This can be accomplished with iptables, or AWS Security Groups, etc.
To top it all off, inter-node communication is not encrypted.
By default, application containers allow unauthenticated writes and reads to and from the Vespa installation. Be sure to lock this down in a production environment.
All Vespa processes run under the Linux user given by
$VESPA_USER and store their data under
$VESPA_HOME. You should ensure the files and directories under
$VESPA_HOME are not accessible by other users if you store sensitive data in your application.
Vespa does not have support for encryption of on-disk document stores or indices.
All of this makes me wonder: was Yahoo using Vespa in 2013? Either way, it looks like Vespa was built to run in an already secure production environment. Perhaps the engine’s newfound opensource community will provide the security features it desperately needs.
In my humble opinion, Vespa is huge. It seems to do a lot and to do it well. That being said, it isn’t ready to compete with well-established big data systems like Hadoop just yet. Vespa gives off a plug-n-play vibe that I can’t ignore. It certainly seems easy to use, but tweaking it for specific goals may be complicated.
Its documentation, in its current state, is inconsistent in quality and does too little to explain how the system works as compared to how it should be used. Maybe this is the direction big data will take in the years to come: powerful tools meant to be used out of the box. If not, then Vespa may be a tad off the mark.