CodaLab – Data Science competitions

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it works and how to install CodaLab On-Premise.

Competition is anchored in our personal and professional lifes. Its goal is not necessarily the desire to be better than others. On the contrary, the main goal is to constantly be able to excel while having fun. In the world of Big Data and more generally the computer world, participating in competitions has several advantages. For example, competing with the others can help build skills on new technologies and evaluate their real abilities. Indeed, by being confronted against the others, we can evaluate our own abilities. Organizing competitions internally can revitalize the group, motivate members of a team. This encourages the development of a good competitive spirit and promote, for instance, the Data Scientists to write more and more powerful codes.

In this regard, a client requested us to look for the different tools available to organize data science competitions internally. We have selected CodaLab and CodaLab Competition. CodaLab allows execution and code sharing within a team. CodaLab Competition allows organizing competitions based on a CodaLab infrastructure.

CodaLab

CodaLab was created in 2013 as a joint venture between Microsoft and Stanford University. Originally, the vision was to create an ecosystem for conducting computational research in a more efficient, reproducible, and collaborative manner; combining worksheets and competitions. Worksheets capture complex research pipelines in a reproducible way and create “executable papers”. With this Open Source Web platform, researchers and developers can collaborate to advance research areas. Mainly in the areas where machine learning and advanced computing are used. Indeed, via CodaLab it is possible to easily share its work to any community. Collaboration is then more effective. The worksheets describe complex research pipelines and create “executable documents”. CodaLab essentially offers the possibility to solve multiple and common problems in the field of data-driven research. Nevertheless, it can also solve more complex problems when the solution can be provided in the form of a zip archive.

CodaLab Competition

Since 2016, CodaLab offers the possibility of organizing online competitions directly on its servers. CodaLab Competition hosts mainly data science competitions, but it is not limited to this area of application. To participate in a competition, simply register and propose a solution. The solution can be a submission of results or codes. The simplest competitions only require the submission of results, which are compared to a solution (or key) by a scoring program. Results submission challenges are less expensive to compute than code submission. Indeed, it is just a comparison of results, involving few possibilities. Code submission allows performance testing by running the submitted code in the same state for all participants. In 2014, ChaLearn, which organizes challenges in the Machine Learning area to stimulate research, has partnered with CodaLab. The goal was the joint development of CodaLab Competition. A particularly exciting new feature of CodaLab Competition is that organizers can now connect their own computing agents to CodaLab‘s backend to redirect code submissions. This feature is interesting because it allows the organization of competitions internally in an architecture specific to the company. There are certain limitations that can be overcome, for example regarding data security.

The architecture is as follows:

  • The CodaLab server that mainly allows sharing via a web interface
  • The CodaLab Competition service that comes on top of the CodaLab server which allows us to have the possibility of setting up competitions.

It is therefore necessary to have at first a functional CodaLab server. Let’s focus now on the architecture and installation of the latter.

CodaLab architecture

Docker

CodaLab uses Docker to manage the local development and deployment of environments because it offers an increased level of reproducibility. Previously, it took hours to install each piece of CodaLab .

Django

Django is the most important part of CodaLab Competition. Django is used to interact with the MySQL database, migrate the state of the database, and perform asynchronous tasks.

MySQL

MySQL is the database used by CodaLab.

RabbitMQ

RabbitMQ is used as a job message broker.

Celery

This is the queue where you can perform long tasks, such as:

  • Create competitions
  • Evaluate submissions
  • Send mails
  • Re execute all submissions
  • Scheduling tasks

Nginx

Nginx is an HTTP server that can manage web requests. We can use it to cache static pages and manage a large influx of traffic if needed.

How does CodaLab uses Docker?

The submitted code on the CodaLab platform is run in a Docker container. This environment can be reproduced identically on a local computer by downloading the corresponding image. The default environment CodaLab contains a large number of pre-loaded programs, such as Python.. It is possible to download or customize the default docker-codalab-legacy-worker image from the Docker hub by searching for codalab/codalab-legacy.

CodaLab Installation

About the installation, the wiki is available. It shows step by step the implementation of CodaLab on an Ubuntu machine. However after several failures during the installation, we will give you an installation manual for CentOS 7 summarizing the main actions to perform. In the first place, you have to download the source code hosted on GitHub:

In the following, the environment variable $HOME will refer to the directory in which the GIT repositories of “codalab-worksheets” and “codalab-cli” are downloaded. The configuration files will be stored in $CODALAB_HOME, which is by default ~/.codalab. Specific packages must be installed beforehand.

Packages installation

Python and virtualEnv dependencies

Nodejs

MySQL

Docker

It is important to have a user “codalab” because some commands must be executed as “codalab” and not “root”.

Execute installation scripts

Once downloaded all the necessary prerequisites we can start the installation. Be careful, you have to run the following commands as codalab

Database configuration

Once the installation is complete, the database must be configured and secured. A “codalab” user and a database with the same name are declared and we will link them to CodaLab.

Codalab must then be connected to the database.

Email service configuration

To have a registration service, you must configure the email service. It allows us to validate, by sending mails, the registration of new users. It also allows them to receive emails from the CodaLab. server. This configuration is done by registering an email address (mail server host, email address, password). It is not possible to configure the sending of mail by an SMTP server specific to the company. To overcome this problem, several solutions are available. For example, we can parse the logs and automate the sending of mails by an SMTP server in case of new registrations. We can also set up a Watchdog that will enable sending emails for each registration event. Nevertheless, the implementation of these solutions can lead to additional tasks to be performed. The standard configuration of the email address via CodaLab is as follows.

Installation and execution of Nginx

Nginx is an HTTP server that will manage all our web requests. At first we will have to install it:

yum install -y nginx

Once installed, it must be configured to work with CodaLab:

This will generate a Nginx file that will be in  $HOME/codalab-worksheets/codalab/config/generated/nginx.

  • Insert include $HOME/codalab-worksheets/codalab/config/generated/nginx  in the HTTP block of /etc/nginx/nginx.conf .

Execution of the different services

When all these actions are carried out, we can launch the various services for the good functioning of CodaLab :

  • Start the website server

  • Start the API service

  • Start the bundle manager

  • Start the worker

Our CodaLab service is now configured and usable. It is available at http://localhost:8080  (or any other listening port with which Nginx is configured).

Advantages

  • When organizing competitions internally, the different evaluation scripts are run and the results are collected in a fully automatic way.
  • Participants can easily test their output formats (for example, on test data) without any help being given.
  • It is relatively easy to define the start and end dates of the different competitions.
  • CodaLab ratings may include multiple different scores and may be anonymous if desired.

Disadvantages

  • CodaLab, with the integration of our own agents, is not yet very stable and we do not really have the hands on the installation. Indeed we launch different setups that take care of the whole installation.
  • The documentation is not detailed enough and not very explicit.
  • It is not possible to use an SMTP server for sending emails. One of the solutions would be to use a Watchdog or to parse the different logs and send emails via our SMTP server.
  • The Git project is not really up to date.

Summary

CodaLab Competition is a great solution to organize competitions internally. However, you must have a functional CodaLab server. The installation of the latter is not yet very fluid. It does not always work well and the  project’s Git repository is not really up to date. We had to navigate all the branches to find the right information and the right scripts. In conclusion, after consultation with the customer’s teams, the decision was made to wait until the technology matures. A compatibility test with a container orchestration solution such as Kubernetes is in the roadmap, and it may give interesting results.

By |2018-12-17T16:45:38+00:00December 17th, 2018|Categories: Big Data, Data Science|Tags: , , , , |0 Comments

About the Author:

Robert Walid is a Big Data Consultant with 1 year of professional experience on Hadoop and Distributed Systems. He has designed, developed and operated data ingestion workflows and real-time services while accompanying his clients in defining their needs and implementing them. He is versatile on Big Data platforms, planning, design and architecture of cluster deployment, administration, maintenance and prototyping and industrialization of applications in collaboration with business users, analysts, Data Scientists, Engineers and Operations Teams. He assists Data Scientists in the monitoring and qualification of new components, the integration and provision of innovative platforms and the training of teams.

Leave A Comment