Ambari - How to blueprint

Ambari - How to blueprint

By Joris RUMMENS

Jan 17, 2018

As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Let’s see how to automate this process with REST requests.

While really handy for deploying one or two clusters, the process of filling hundreds of fields involving many copy/pasting when deploying a dozen of them can be painful. This is where automation comes in.

Our clients usually choose to use an entreprise-ready distribution like Hortonworks HDP or Cloudera CDH with their built-in cluster deployment and management solutions, namely Ambari and Cloudera Manager. These tools offer an easy way to deploy clusters through their well documented and straightforward UIs. In this article, we will focus on HDP’s deployment tool: Ambari, and its cluster definition files: blueprints.

What are blueprints

Blueprints in an Ambari environment can mean two things. The first one is the following, taken directly from Ambari’s documentation:

Ambari Blueprints are a declarative definition of a cluster. With a Blueprint, you specify a Stack, the Component layout and the Configurations to materialize a Hadoop cluster instance (via a REST API) without having to use the Ambari Cluster Install Wizard.

This is the global definition of the Ambari Blueprint technology. This technology is, in fact, two JSON files submitted one after the other to Ambari’s REST API.

One of these files, the first to be submitted, is the second meaning of a blueprint. It represents a template that can be used for as many cluster deployments as we like. Since it can be used to define multiple clusters over various environments, it has to be as generic as possible.

The second file to be submitted will be used to set all properties that are limited to one cluster instance. We’ll call it the cluster file. Ambari uses the information gotten from the previously submitted blueprint file and enriches them with the cluster file to launch the deployment process. Properties that are set in the cluster file will override the ones of the blueprint file when needed.

This is what a cluster deployment using Ambari’s Blueprints looks like:

  1. Install and configure Ambari to be ready to receive a cluster deployment request
  2. Create and submit the blueprint.json file via the REST API to Ambari
  3. Create and submit the cluster.json file via the REST API to Ambari
  4. Wait for the deployment process to end
  5. Tune the configurations set by Ambari’s stack advisor

File structure - blueprint.json

The blueprint.json file has three categories at its root:

  • Blueprints, where the blueprint’s global information is set. This includes the stack name and version, and security type.
  • host_groups, which defines host profiles and the components that are deployed on each of them.
  • configurations, with most of the non-default configurations of these components.

At this point, your JSON file should look like this:

{
  "Blueprints": {},
  "host_groups": [],
  "configurations": []
}

Category content - blueprints

Ambari supports multiple stacks to deploy. The most used is Hortonworks’ HDP, that’s what we’ll use here in our example. As for the security, choose between NONE and KERBEROS. You might want to add a custom kerberos_descriptor, but in our case it was not needed so we’ll not explain it further.

Here’s an easy and functionnal sample of your Blueprints category for a kerberized HDP 2.6 cluster:

"Blueprints":{
  "stack_name":"HDP",
  "stack_version":"2.6",
  "security":{
    "type":"KERBEROS"
  }
}

Category content - host groups

Host groups define templates to apply to groups of hosts in your cluster.

These are the information you can set as a template:

  • The components that will be deployed on each host mapped to this profile
  • The number of hosts expected to match this profile
  • Some custom configurations to be applied to only this type of hosts
  • A name that best represents hosts of this profile

Some examples of host groups you might want to define in this section: management nodes, worker nodes, master nodes, edge nodes…

Note that you’ll probably have to define multiple master node profiles as they usually do not share the same components.

For HDP 2.6, these are the available components:

hdfs:
- NAMENODE
- ZKFC
- JOURNALNODE
- DATANODE
- HDFS_CLIENT
zookeeper:
- ZOOKEEPER_SERVER
- ZOOKEEPER_CLIENT
tez:
- TEZ_CLIENT
yarn:
- RESOURCEMANAGER
- APP_TIMELINE_SERVER
- NODEMANAGER
- YARN_CLIENT
mapreduce:
- HISTORYSERVER
- MAPREDUCE2_CLIENT
slider:
- SLIDER
ranger:
- RANGER_ADMIN
- RANGER_USERSYNC
- RANGER_TAGSYNC
logsearch:
- LOGSEARCH_SERVER
- LOGSEARCH_LOGFEEDER
ambari_infra:
- INFRA_SOLR
- INFRA_SOLR_CLIENT
ambari_metrics:
- METRICS_COLLECTOR
- METRICS_GRAFANA
- METRICS_COLLECTOR
hbase:
- HBASE_MASTER
- HBASE_REGIONSERVER
- HBASE_CLIENT
- PHOENIX_QUERY_SERVER
atlas:
- ATLAS_SERVER
- ATLAS_CLIENT
oozie:
- OOZIE_SERVER
- OOZIE_CLIENT
kafka:
- KAFKA_BROKER
storm:
- STORM_UI_SERVER
- NIMBUS
- DRPC_SERVER
- SUPERVISOR
sqoop:
- SQOOP
zeppelin:
- ZEPPELIN_MASTER
hive:
- HCAT
- HIVE_SERVER
- HIVE_SERVER_INTERACTIVE
- HIVE_METASTORE
- WEBHCAT_SERVER
- HIVE_CLIENT
pig:
- PIG
spark:
- SPARK_THRIFTSERVER
- SPARK_CLIENT
- SPARK_JOBHISTORYSERVER
spark2:
- SPARK2_THRIFTSERVER
- SPARK2_CLIENT
- SPARK2_JOBHISTORYSERVER
kerberos:
- KERBEROS_CLIENT
knox:
- KNOX_GATEWAY

Here is a host group sample for worker nodes:

{
  "name":"worker",
  "cardinality":"6",
  "components":[
    { "name":"DATANODE" },
    { "name":"NODEMANAGER" },
    { "name":"LOGSEARCH_LOGFEEDER" },
    { "name":"METRICS_MONITOR" },
    { "name":"HBASE_REGIONSERVER" },
    { "name":"SUPERVISOR" },
    { "name":"KERBEROS_CLIENT" }
  ],
  "configurations":[]
}

Category content - configurations

This is where you will put most of your custom configurations.

There is no need to set every configuration property for every component you plan to deploy. Most of them have default values defined by their component, and Ambari comes with a stack advisor that sets automatically some others based on your infrastructure. Add the ones that only you are able to define, which is plenty enough.

The structure of a configuration item is the following:

"configurations":[
  {
    "configuration_category":{
      "properties":{
        "property": "value",
        "property": "value"
      }
    }
  },
  {
    "configuration_category":{
      ...
    }
  },
  ...
]

A configuration category is a set of properties that are can usually be found in a single configuration file. Some common examples are: core-site, hdfs-site, hadoop-env, zookeeper-env, …

To get an exhaustive list of the configuration categories supported by Ambari you can either export a blueprint from an existing cluster with the same components deployed on it or look at the configuration sections on the UI. Be aware that Ambari may divide a category in several sections. For example, the “core-site” category can be found as “Advanced core-site” and “Custom core-site” on the UI, but is defined as simply “core-site” in a blueprint file.

Also, a good practice is to leave Ambari to handle the resource sizing of your components first and then tune them through the UI.

There is a configuration category though that is not in the UI and is not part of one of the components you want to deploy: cluster-env. This is a special category for Ambari’s own properties and is used by it to know how it should deploy your cluster. If you once deployed a cluster through the UI, you will notice that its properties are the ones found in the “Misc” tab.

So, here’s a part of what the configurations category could contain:

"configurations":[
  {
    "cluster-env":{
      "properties":{
        "ignore_groupsusers_create":"true"
      }
    }
  },
  {
    "core-site":{
      "properties":{
        "fs.defaultFS":"hdfs://mynamespace",
        "ha.zookeeper.quorum":"%HOSTGROUP::zk_node%:2181",
        "ipc.maximum.data.length":"134217728"
      }
    }
  },
  {
    "hadoop-env":{
      "properties":{
        "hdfs_log_dir_prefix":"/path/to/logs/hadoop"
      }
    }
  }
]

In the previous example, you can see a value called %HOSTGROUP::zk_node%. This is a variable that will be replaced by all hostnames mapped with the host group zk_node. Be cautious though when using it, as the conversion is not yet supported on all properties.

Properties that are known to handle the %HOSTGROUP::hg_name% conversion:

core-site:
- ha.zookeeper.quorum
hdfs-site:
- dfs.namenode.http-address
- dfs.namenode.http-address.mynamespace.nn1
- dfs.namenode.http-address.mynamespace.nn2
- dfs.namenode.https-address
- dfs.namenode.https-address.mynamespace.nn1
- dfs.namenode.https-address.mynamespace.nn2
- dfs.namenode.rpc-address.mynamespace.nn1
- dfs.namenode.rpc-address.mynamespace.nn2
- dfs.namenode.shared.edits.dir
yarn-site:
- hadoop.registry.zk.quorum
- yarn.log.server.url
- yarn.resourcemanager.address
- yarn.resourcemanager.admin.address
- yarn.resourcemanager.hostname
- yarn.resourcemanager.resource-tracker.address
- yarn.resourcemanager.scheduler.address
- yarn.resourcemanager.webapp.address
- yarn.resourcemanager.webapp.https.address
- yarn.resourcemanager.zk-address
- yarn.resourcemanager.hostname.rm1
- yarn.resourcemanager.hostname.rm2
- yarn.timeline-service.address
- yarn.timeline-service.webapp.address
- yarn.timeline-service.webapp.https.address
admin-properties:
- policymgr_external_url
ranger-kafka-plugin-properties:
- zookeeper.connect
hbase-site:
- hbase.zookeeper.quorum
application-properties:
- atlas.audit.hbase.zookeeper.quorum
- atlas.graph.index.search.solr.zookeeper-url
- atlas.graph.storage.hostname
- atlas.kafka.bootstrap.servers
- atlas.kafka.zookeeper.connect
oozie-site:
- oozie.base.url
- oozie.zookeeper.connection.string
oozie-env:
- oozie_hostname
kafka-broker:
- zookeeper.connect
storm-site:
- storm.zookeeper.servers
hive-site:
- hive.zookeeper.quorum
- hive.metastore.uris
hive-interactive-site:
- hive.llap.zk.sm.connectionString

When not supported and you are required by the property to set actual hostnames, define it in the cluster.json file instead (see section “File structure - cluster.json” below).

File structure - cluster.json

While the blueprint.json file represents the template of your cluster deployment, the cluster.json file is the instantiation of your deployment. This means that it is specific to one cluster, and has hard defined values in it.

The cluster.json file has five categories at its root:

  • blueprint, the name of the blueprint (template) that you previously created. Its name is defined at its submission.
  • host_groups, which are the mapping between the hostnames of your infrastructure and their profile defined in the blueprint.
  • configurations, with the properties that are specific to this cluster deployment.
  • security, which has the same value as the property of the “Blueprints” section of the blueprint.
  • credentials, KDC connection information for a kerberized cluster

At this point, your JSON file should look like this:

{
  "blueprint":"blueprint_name",
  "host_groups":[],
  "configurations":[],
  "security":{
    "type":"NONE|KERBEROS"
  },
  "credentials":[]
}

Category content - host groups

Unlike the blueprint, host groups in the cluster.json file is used to map a real host to a previously defined template.

Its structure is fairly straightforward, as you just set the name of the template (aka. group) and assign a list of hosts to it:

{
  "name":"group_name",
  "hosts":[
    { "fqdn":"my_hostname" },
    { "fqdn":"my_hostname" },
    ...
  ]
}

To take the worker template as an example again, here’s what it would look like:

"host_groups":[
  {
    "name":"worker",
    "hosts":[
      { "fqdn":"worker1_hostname" },
      { "fqdn":"worker2_hostname" },
      { "fqdn":"worker3_hostname" },
      { "fqdn":"worker4_hostname" },
      { "fqdn":"worker5_hostname" },
      { "fqdn":"worker6_hostname" }
    ]
  },
  {
    "name":"some_other_group",
    "hosts":[
      ...
    ]
  },
  ...
]

Category content - configurations

Most of your configuration properties should have been defined in the blueprint.json file as they can be used in various cluster implementations.

However, there are two types of properties that are limited to a specific deployment:

  • user and infrastructure dependent configurations
  • configurations that do not handle %HOSTGROUP::hg_name% conversion

You might also want to add properties that rely on the previously mentioned ones.

In the first category, you’ll mostly find database connection information, authentication credentials, and business-related properties like YARN queues.

This is a sample of these configurations:

capacity-scheduler:
- yarn.scheduler.capacity.root.queues
- yarn.scheduler.capacity.root.myqueue.capacity
- yarn.scheduler.capacity.root.myqueue.maximum-capacity
- yarn.scheduler.capacity.root.myqueue.acl_administer_queue
- yarn.scheduler.capacity.root.myqueue.acl_submit_applications
- yarn.scheduler.capacity.root.myqueue.queues
ranger-admin-site:
- ranger.jpa.jdbc.driver
- ranger.jpa.jdbc.url
admin-properties:
- db_host
- db_name
- db_user
- db_root_user
- db_password
- db_root_password
ranger-env:
- ranger_admin_username
- ranger_admin_password
ranger-hdfs-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
ranger-yarn-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
ranger-atlas-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
ranger-kafka-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
ranger-storm-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
ranger-hive-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
ranger-knox-audit:
- xasecure.audit.destination.hdfs.dir # If you enabled HA and thus have a namespace name
logsearch-admin-json:
- logsearch_admin_password
logsearch-env:
- logsearch_truststore_password
- logsearch_keystore_password
logfeeder-env:
- logfeeder_truststore_password
- logfeeder_keystore_password
ams-grafana-env:
- metrics_grafana_username
- metrics_grafana_password
atlas-env:
- atlas.admin.username
- atlas.admin.password
oozie-site:
- oozie.db.schema.name
- oozie.service.JPAService.jdbc.password
- oozie.service.JPAService.jdbc.url
- oozie.service.JPAService.jdbc.username
hive-site:
- javax.jdo.option.ConnectionURL
- ambari.hive.db.schema.name
- javax.jdo.option.ConnectionUserName
- javax.jdo.option.ConnectionPassword
hive-env:
- hive_database_name
knox-env:
- knox_master_secret

The following properties are known to not handle the %HOSTGROUP::hg_name% conversion:

ranger-admin-site:
- ranger.audit.solr.zookeepers
- ranger.sso.providerurl
ranger-hdfs-security:
- ranger.plugin.hdfs.policy.rest.url
ranger-yarn-security:
- ranger.plugin.yarn.policy.rest.url
ranger-hbase-security:
- ranger.plugin.hbase.policy.rest.url
ranger-atlas-security:
- ranger.plugin.atlas.policy.rest.url
ranger-kafka-security:
- ranger.plugin.kafka.policy.rest.url
ranger-storm-security:
- ranger.plugin.storm.policy.rest.url
ranger-hive-security:
- ranger.plugin.hive.policy.rest.url
ranger-knox-security:
- ranger.plugin.knox.policy.rest.url
ranger-hdfs-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-yarn-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-hbase-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-atlas-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-kafka-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-storm-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-hive-audit:
- xasecure.audit.destination.solr.zookeepers
ranger-knox-audit:
- xasecure.audit.destination.solr.zookeepers
application-properties:
- atlas.rest.address

The configuration category keeps the same structure as in the blueprint.json file. Here’s a sample:

"configurations":[
  {
   "ranger-admin-site":{
      "properties":{
         "ranger.jpa.jdbc.driver":"org.postgresql.Driver",
         "ranger.jpa.jdbc.url":"jdbc:postgresql://dbhostname:5432/rangerdb",
         "ranger.audit.solr.zookeepers":"zknode1:2181,zknode2:2181,zknode3:2181/infra-solr",
         "ranger.sso.providerurl":"https://gateway_hostname:8443/gateway/knoxsso/api/v1/websso"
      }
   }
  },
  {
     "ranger-env":{
        "properties":{
           "ranger_admin_username":"dbuser",
           "ranger_admin_password":"dbpassword"
        }
     }
  },
  ...
]

Category content - credentials

For the same reasons as the properties set in the configurations section, KDC credentials of a secure cluster have to be defined on a deployment basis.

This is the structure of it:

"credentials":[
   {
      "alias":"kdc.admin.credential",
      "principal":"myadminprincipal@REALM",
      "key":"principal_password",
      "type":"TEMPORARY"
   }
]

Ambari REST API usage

Be sure to have a running Ambari server and agents to send the blueprint to. The remote Ambari repository also has to be reachable for components like Ambari-Infra or Ambari-Metrics.

Repositories registration

First, register your HDP repositories to use for this deployment. This can be done using the following request:

curl -H "X-Requested-By: ambari" -u $user:$password -X PUT -d '{
  "Repositories" : {
    "base_url" : "$HDP_base_url",
    "verify_base_url" : true
  }
}' http://${ambari_host}:8080/api/v1/stacks/HDP/versions/${HDP_version}/operating_systems/${OS}/repositories/HDP-${HDP_version}

curl -H "X-Requested-By: ambari" -u $user:$password -X PUT -d '{
  "Repositories" : {
    "base_url" : "$HDP-UTILS_base_url",
    "verify_base_url" : true
  }
}' http://${ambari_host}:8080/api/v1/stacks/HDP/versions/${HDP_version}/operating_systems/${OS}/repositories/HDP-UTILS-${HDP-UTILS_version}

To register HDP 2.6 for RedHat 7 from hortonworks’ public repositories, use the following:

curl -H "X-Requested-By: ambari" -u admin:admin -X PUT -d '{
  "Repositories" : {
    "base_url" : "http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.1.0",
    "verify_base_url" : true
  }
}' http://${ambari_host}:8080/api/v1/stacks/HDP/versions/2.6/operating_systems/redhat7/repositories/HDP-2.6

curl -H "X-Requested-By: ambari" -u admin:admin -X PUT -d '{
  "Repositories" : {
    "base_url" : "http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos7",
    "verify_base_url" : true
  }
}' http://${ambari_host}:8080/api/v1/stacks/HDP/versions/2.6/operating_systems/redhat7/repositories/HDP-UTILS-1.1.0.21

Blueprint files submission

As seen above, start by submitting the blueprint template file. For the ${blueprint_name}, use the same value as the “blueprint” property of your cluster.json file.

curl -H "X-Requested-By: ambari" -X POST -u $user:$password -d '@/path/to/blueprint.json' http://${ambari_host}:8080/api/v1/blueprints/${blueprint_name}

Finally, submit the definition of the current cluster. It will take ${cluster_name} as the name of the cluster.

curl -H "X-Requested-By: ambari" -X POST -u $user:$password -d '@/path/to/cluster.json' http://${ambari_host}:8080/api/v1/clusters/${cluster_name}

Conclusion

Even with blueprints, they are plenty of configuration parameters to set. In fact, it may even take longer to create a single blueprint than to fill all fields of each service in Ambari’s web wizard. You’ll want to use the blueprints when deploying multiple clusters, or creating and destroying environments automatically.

To do this, more than just blueprints is required. For example for one of our customer, we use Puppet to automate the hosts preparation and Ambari’s server and agents installation. When done, it runs a custom built ruby script to generate the blueprint.json and cluster.json files and submit them to the newly installed Ambari. The same can be done through Ansible, or even a custom orchestration engine like the one we wrote, Nikita.

In conclusion, Ambari’s blueprints enable the automation of an HDP (or other distribution) deployment, but can hardly do it alone. Choose the tools that fit you the most or that are currently used by your company, and create a JSON builder for the blueprint.json and cluster.json files.

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.