CodaLab – Data Science competitions
Dec 17, 2018
Never miss our publications, subscribe to the Adaltas' newsletter about Open Source, big data and distributed systems. We maintain a low frequency of one email every two months.
CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it works and how to install CodaLab On-Premise.
Competition is anchored in our personal and professional lifes. Its goal is not necessarily the desire to be better than others. On the contrary, the main goal is to constantly be able to excel while having fun. In the world of Big Data and more generally the computer world, participating in competitions has several advantages. For example, competing with the others can help build skills on new technologies and evaluate their real abilities. Indeed, by being confronted against the others, we can evaluate our own abilities. Organizing competitions internally can revitalize the group, motivate members of a team. This encourages the development of a good competitive spirit and promote, for instance, the Data Scientists to write more and more powerful codes.
In this regard, a client requested us to look for the different tools available to organize data science competitions internally. We have selected CodaLab and CodaLab Competition. CodaLab allows execution and code sharing within a team. CodaLab Competition allows organizing competitions based on a CodaLab infrastructure.
CodaLab was created in 2013 as a joint venture between Microsoft and Stanford University. Originally, the vision was to create an ecosystem for conducting computational research in a more efficient, reproducible, and collaborative manner; combining worksheets and competitions. Worksheets capture complex research pipelines in a reproducible way and create “executable papers”. With this Open Source Web platform, researchers and developers can collaborate to advance research areas. Mainly in the areas where machine learning and advanced computing are used. Indeed, via CodaLab it is possible to easily share its work to any community. Collaboration is then more effective. The worksheets describe complex research pipelines and create “executable documents”. CodaLab essentially offers the possibility to solve multiple and common problems in the field of data-driven research. Nevertheless, it can also solve more complex problems when the solution can be provided in the form of a zip archive.
Since 2016, CodaLab offers the possibility of organizing online competitions directly on its servers. CodaLab Competition hosts mainly data science competitions, but it is not limited to this area of application. To participate in a competition, simply register and propose a solution. The solution can be a submission of results or codes. The simplest competitions only require the submission of results, which are compared to a solution (or key) by a scoring program. Results submission challenges are less expensive to compute than code submission. Indeed, it is just a comparison of results, involving few possibilities. Code submission allows performance testing by running the submitted code in the same state for all participants. In 2014, ChaLearn, which organizes challenges in the Machine Learning area to stimulate research, has partnered with CodaLab. The goal was the joint development of CodaLab Competition. A particularly exciting new feature of CodaLab Competition is that organizers can now connect their own computing agents to CodaLab’s backend to redirect code submissions. This feature is interesting because it allows the organization of competitions internally in an architecture specific to the company. There are certain limitations that can be overcome, for example regarding data security.
The architecture is as follows:
- The CodaLab server that mainly allows sharing via a web interface
- The CodaLab Competition service that comes on top of the CodaLab server which allows us to have the possibility of setting up competitions.
It is therefore necessary to have at first a functional CodaLab server. Let’s focus now on the architecture and installation of the latter.
RabbitMQ is used as a job message broker.
This is the queue where you can perform long tasks, such as:
- Create competitions
- Evaluate submissions
- Send mails
- Re execute all submissions
- Scheduling tasks
Nginx is an HTTP server that can manage web requests. We can use it to cache static pages and manage a large influx of traffic if needed.
How does CodaLab uses Docker?
The submitted code on the CodaLab platform is run in a Docker container. This environment can be reproduced identically on a local computer by downloading the corresponding image. The default environment CodaLab contains a large number of pre-loaded programs, such as Python.. It is possible to download or customize the default
docker-codalab-legacy-worker image from the Docker hub by searching for codalab/codalab-legacy.
About the installation, the wiki is available. It shows step by step the implementation of CodaLab on an Ubuntu machine. However after several failures during the installation, we will give you an installation manual for CentOS 7 summarizing the main actions to perform. In the first place, you have to download the source code hosted on GitHub:
git clone https://github.com/codalab/codalab-worksheets git clone https://github.com/codalab/codalab-cli
In the following, the environment variable
$HOME will refer to the directory in which the GIT repositories of
codalab-cli are downloaded. The configuration files will be stored in
$CODALAB_HOME, which is by default
~/.codalab. Specific packages must be installed beforehand.
Python and virtualEnv dependencies
yum install -y python-virtualenv
yum install -y epel-release yum install npm yum install -y gcc make
wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm yum update yum -y install mysql-server yum install -y python-devel mysql-devel
wget https://download.docker.com/linux/centos/7/x86_64/stable/Packages/docker-ce-selinux-17.03.0.ce-1.el7.centos.noarch.rpm yum install -y docker-ce-selinux-17.03.0.ce-1.el7.centos.noarch.rpm wget https://download.docker.com/linux/centos/7/x86_64/stable/Packages/docker-ce-17.03.0.ce-1.el7.centos.x86_64.rpm yum install -y docker-ce-17.03.0.ce-1.el7.centos.x86_64.rpm
It is important to have a user
codalab because some commands must be executed as
codalab and not
useradd codalab usermod -aG wheel codalab
Once downloaded all the necessary prerequisites we can start the installation. Be careful, you have to run the following commands as
chown -R codalab: "codalab-cli/" "codalab-worksheets/" cd "$HOME/codalab-worksheets" && ./setup.sh cd "$HOME/codalab-cli" && ./setup.sh server
Once the installation is complete, the database must be configured and secured. A
codalab user and a database with the same name are declared and we will link them to CodaLab.
sudo mysql -u root CREATE USER "codalab"@"localhost" IDENTIFIED BY "<passwd>" ; CREATE DATABASE codalab_bundles; GRANT ALL ON codalab_bundles.* TO "codalab"@"localhost";
Codalab must then be connected to the database.
cd "$HOME/codalab-cli" && codalab/bin/cl config server/engine_urlmysql://codalab:<passwd>@localhost:3306/codalab_bundles
To have a registration service, you must configure the email service. It allows us to validate, by sending mails, the registration of new users. It also allows them to receive emails from the CodaLab. server. This configuration is done by registering an email address (mail server host, email address, password). It is not possible to configure the sending of mail by an SMTP server specific to the company. To overcome this problem, several solutions are available. For example, we can parse the logs and automate the sending of mails by an SMTP server in case of new registrations. We can also set up a Watchdog that will enable sending emails for each registration event. Nevertheless, the implementation of these solutions can lead to additional tasks to be performed. The standard configuration of the email address via CodaLab is as follows.
$HOME/codalab-cli/codalab/bin/cl config email/host <host> $HOME/codalab-cli/codalab/bin/cl config email/user <username> $HOME/codalab-cli/codalab/bin/cl config email/password <password> $HOME/codalab-cli/codalab/bin/cl config admin-email <email>
Installation and execution of Nginx
Nginx is an HTTP server that will manage all our web requests. At first we will have to install it:
yum install -y nginx
Once installed, it must be configured to work with CodaLab:
cd "$HOME/codalab-worksheets/codalab" && ./manage config_gen
This will generate a Nginx file that will be in
include $HOME/codalab-worksheets/codalab/config/generated/nginxin the HTTP block of
When all these actions are carried out, we can launch the various services for the good functioning of CodaLab:
- Start the website server
cd "/opt/codalab-worksheets/codalab" ./manage runserver 127.0.0.1:2700
- Start the API service
cd "/opt/codalab-cli" codalab/bin/cl server
- Start the bundle manager
cd "/opt/codalab-cli" codalab/bin/cl bundle-manager
- Start the worker
cd "/opt/codalab-cli/worker/codalabworker" ./worker.sh --server http://localhost:2900 --password /home/codalab/root.password
- When organizing competitions internally, the different evaluation scripts are run and the results are collected in a fully automatic way.
- Participants can easily test their output formats (for example, on test data) without any help being given.
- It is relatively easy to define the start and end dates of the different competitions.
- CodaLab ratings may include multiple different scores and may be anonymous if desired.
- CodaLab, with the integration of our own agents, is not yet very stable and we do not really have the hands on the installation. Indeed we launch different setups that take care of the whole installation.
- The documentation is not detailed enough and not very explicit.
- It is not possible to use an SMTP server for sending emails. One of the solutions would be to use a Watchdog or to parse the different logs and send emails via our SMTP server.
- The Git project is not really up to date.
CodaLab Competition is a great solution to organize competitions internally. However, you must have a functional CodaLab server. The installation of the latter is not yet very fluid. It does not always work well and the project’s Git repository is not really up to date. We had to navigate all the branches to find the right information and the right scripts. In conclusion, after consultation with the customer’s teams, the decision was made to wait until the technology matures. A compatibility test with a container orchestration solution such as Kubernetes is in the roadmap, and it may give interesting results.