On complex production environments, servers usually have multiple network interfaces, be it for performance, network isolation, business related or any other random reason tech guys at large corporations might think of. Hadoop has thus to be able to handle these multiple networks, and use the right ones for the each job.
Having multiple networks in production environments enables various governance capabilities. For instance, you might want to use a local dedicated network to your Hadoop cluster and a separate one to communicate with external components. Or in within a cluster you want your HDP components to talk to each other on their own network, and have clients access them through another one. Or all of it in case of a disaster recovery architecture, where you have a custom high bandwidth network for data backup jobs between clusters, each of them having their own multi-network architecture to separate intra-components communication and client access.
In this case, we’ll talk about how to configure HDFS primarily to chose the right network for each task.
What we will go through
– the why: reasons multihomed environments make sense (spoiler: see context)
– the what: infrastructure components to be aware of (hint: NICs and DNS)
– the how: important configuration properties that handle just this
Joris is a BigData administrator with a Java developer background. He worked for several organizations each having their own unique environments and needs, on short expertise and support missions as well as long BigData department building ones, and as an Hortonworks software architect and on operational and business-support positions.