How to create a libcloud driver from scratch: the NephoScale case

November 18, 2013

The use of Web APIs for communicating with cloud providers has been increasing over the last few years and every computing and web services provider is expected to maintain a well documented API for their clients. It enables developers, system administrators and engineers to automate tasks and assists in the deployment of servers and services. As multi-cloud setups become the norm and cloud management tools a necessity, a well documented API significantly enhances the user experience.

There are many tools in different programming languages that facilitate the communication with each cloud API. In Python, Apache Libcloud is the de facto library to communicate with one or more cloud providers. It abstracts away differences among multiple cloud provider APIs and provides a unified way to manage cloud resources.

Mist.io makes heavy use of Libcloud in order to support a wide range of providers in a unified interface.

Introduction to Libcloud

Libcloud was initially developed by the awesome CloudKick team. Today, it is an Apache foundation project. It has a healthy community, with lots of contributors, updated documentation, mailing lists, a ticketing system, a continuous integration platform and an established user base.

Libcloud allows users to manage four different cloud resources: cloud servers, cloud storage, load balancers-as-a-service and DNS-as-a-service. Cloud servers is the oldest and more mature part of the library and currently supports more than 26 providers.

Writing a Libcloud Driver

Many cloud providers e.g. Amazon, Rackspace, Linode, DigitalOcean etc, are already supported by Libcloud. In this post we want to illustrate how easy it is to add support for new providers, by documenting the steps we followed to add support for a cutting edge US-based cloud provider NephoScale.

What should we implement

Our goal is to create a compute driver that will provide the most common sever management actions. Actions that are universally supported by Libcloud for all the cloud providers:

list_nodes: provide a list with our nodes, including information such as public/private ip addresses, node state -running, rebooting, stopped etc- and some extra metadata such as the image, location etc
list_images: list all the images the provider supports for faster node creation
list_sizes: list different plans, providing data such as CPU, ram and disk size. Some providers include the price of each size plan
list_locations: Most cloud providers allow us to specify the different zone where we can deploy nodes

and the following node actions:

create_node: create a node specifying it's size, image, name and location
deploy_node: create a node and deploy a public ssh key for server access, and optionally run a deploy script after the node is initialized and can be accessible
reboot_node/shutdown_node
start_node/stop_node: start a stopped node, or stop a started one
destroy_node: Delete the node

We will implement the above for the NephoScale driver, plus some extra functions that will allow us to create, list and delete keypairs. Keypairs in NephoScale can be ssh or password keys, for server or console access. DigitalOcean also allows the listing/creation/deletion of keys through its API, while other cloud providers do not offer that functionality. Other examples of provider specific functionality include tagging, where only a few cloud providers (e.g. AWS, Rackspace) provide. In this example and in order to be as generic as possible, we will target functionality that is available for most - if not all - cloud providers.

Getting started with the compute driver

We will need access to Nephoscale's API. We will use the DigitalOcean driver as a template as it is one of the most recent, well written and up-to-date drivers. There is an effort by Libcloud to standardize the basic functionality for all the compute drivers and update the oldest ones. It is also very important when cloud providers contribute code or review the drivers created for them in Libcloud, or even better maintain them as their API changes.

Compute drivers are located in libcloud/compute/drivers and they all inherit from the base compute driver NodeDriver in libcloud/compute/base.py. The driver is well documented and we can learn a lot from reading the comments and of course the code.

We will start by adding our provider on the DRIVERS dict in libcloud/compute/providers.py and in Provider class on libcloud/compute/types.py. It is pretty straightforward as we can see from the existing drivers.

In libcloud/compute/drivers/ we create a file with the name of our provider. In our case, nephoscale.py.

After the initial imports that we copy from the DigitalOcean driver, we specify the API endpoint of our cloud provider

API_HOST = 'api.nephoscale.com'

and a dict with the states that the nodes can have. This differs from provider to provider, with some providers having just a few states -running, stopped - and others having intermediates ones -e.g. rebooting, shutting down etc.

NODE_STATE_MAP = {
    'on': NodeState.RUNNING,
    'off': NodeState.UNKNOWN,
    'unknown': NodeState.UNKNOWN,
}

Writing the NephoscaleNodeDriver

We then need to create a NodeDriver based driver

class NephoscaleNodeDriver(NodeDriver):
    """Nephoscale node driver class"""

    type = Provider.NEPHOSCALE
    api_name = 'nephoscale'
    name = 'NephoScale'
    website = 'http://www.nephoscale.com'
    connectionCls = NephoscaleConnection
    features = {'create_node': ['ssh_key']}

    def list_locations(self):
        ...

Here we specify the type (same as on libcloud/compute/types.py), the connection class (NephoscaleConnection) and a features dict with the way deploy_node will try to authenticate to the created node after create_node has run. Inside the NephoscaleNodeDriver is where all our functions will live. We will write all functions we want to implement there, list_images, list_nodes, reboot_node etc. To keep consistency, we make sure that list_nodes returns a list of Node objects, list_images a list of NodeImage objects, list_sizes a list of NodeSize objects and list_locations a list of NodeLocation objects.

Writing the Connection class

NephoscaleConnection is the Connection class that will handle connection, authentication and send the request to the NephoScale API endpoint. All drivers implement a Connection class that is responsible for sending the request data, adding HTTP headers or params and also encoding the request body.The API host endpoint is specified there, along with the HTTP headers or params that are used to authenticate for each provider.

NephoScale API calls require HTTP Basic Authentication with the user/password base64 encoded on every request, so we'll add it to add_default_params.

class NephoscaleConnection(ConnectionUserAndKey):
    """Nephoscale connection class

    Authenticates to the API through Basic Authentication
    with username/password
    """
    host = API_HOST
    responseCls = NephoscaleResponse

    def add_default_headers(self, headers):
        user_b64 = base64.b64encode(b('%s:%s' % (self.user_id, self.key)))
        headers['Authorization'] = 'Basic %s' % (user_b64.decode('utf-8'))
        return headers

Other providers handle authentication with requiring the params on each request -API key, password, secret etc.

DigitalOcean for example requests the client ID and API key on each request, and we pass it with add_default_params

class DigitalOceanConnection(ConnectionUserAndKey):

    ...

    def add_default_params(self, params):
        params['client_id'] = self.user_id
        params['api_key'] = self.key
        return params

Inside our Connection class we can override encode_data method, so that it encodes data as requested. For example we can set it to encode data with json.encode, if the API endpoint expects json encoded data.

The Connection class that gets overridden on our driver lives in libcloud/common/base.py.

Writing the Response class

All drivers also need a response class that handles responses of the API endpoint, parses body and returns Exceptions or the actual return content. Response classes for the Libcloud drivers derive from JsonResponse or XMLRPCResponse, depending on the response type of the cloud provider -Json or XML. Both derive from Response class that lives in libcloud/common/base.py. We want to make sure here that in the case of errors we throw the correct exceptions - e.g. InvalidCredsError for failed authentication.

class NephoscaleResponse(JsonResponse):
    """Nephoscale API Response"""

    def parse_error(self):
        if self.status == httplib.UNAUTHORIZED:
            raise InvalidCredsError('Authorization Failed')
        if self.status == httplib.NOT_FOUND:
            raise Exception("The resource you are looking for is not found.")

        return self.body

Notes on create_node and deploy_node

Most of the drivers of Libcloud implement create_node to return a Node object, of the newly created host. Function deploy_node is not overridden , but instead gets called from the NodeDriver, since it does a lot of things that need not be rewritten, such as checks that the node public ip is up, that the machine can be accessible, that it can authenticate via password or ssh key, and optionally runs a deploy script.

The default behaviour of create_node is to send the request for the node to be created, and get a response with the created node id. Most cloud providers we have used reply with the ID of the created node, along with information about the state of the node, public ip etc. NephoScale does not respond with an ID once it receives the request for a node to be created, but instead replies with a status ID that is not very helpful. So as soon as we send the request and receive a valid response, we'll be waiting for a bit and try to get the machine information by requesting list_nodes and checking if the machine has appeared. When we have the machine Node, we return create_node. In this way the behaviour of create_node stays consistent with the other drivers and deploy_node works as expected, out of the box.

Once again, reading the comments of NodeDriver on libcloud/compute/base.py for deploy_node and create_node functions will help us implement create_node and make sure deploy_node will work.

Keep consistency, hard code things when necessary

One of the cool things with Libcloud is that it standardizes things. It enforces the use of basic entities (nodes, sizes, images, locations) for all cloud providers it supports.

SoftLayer for example does not provide a size entity through its API, but rather expects that we specify CPU, RAM and disk size when we create a node. The way Libcloud driver is implemented, one can either pass these manually, or simply select one of the sizes provided by list_sizes.

Ideally all size/image/location related data should be fetched by asking the provider. When this is not possible, for example when the provider does not implement a request, we need to hard code these settings. For example the pricing info for some of the providers isn't returned while asking for the sizes (list_sizes) and thus is hard coded on libcloud/data/pricing.json. Or the info regarding sizes for AWS is not available through an API call. The obvious disadvantage with hard coding things is that they have to be maintained to reflect the current status of the provider's API.

Working with our driver

While writing the driver, it is always good to test it:

user@user:~/dev/libcloud$ python
>>> from libcloud.compute.types import Provider
>>> from libcloud.compute.providers import get_driver
>>> driver = get_driver(Provider.NEPHOSCALE)
>>> conn = driver('user','correct_password')
>>> conn.list_nodes()
[<Node: uuid=e20bdbf7ef6890645f5b217e0bd2b5912b969cc1,
name=nepho-7, state=0, public_ips=['198.89.109.116'],
provider=NephoScale ...>]

We can use the connection class for GET or POST requests according to the API, for testing and debugging issues. E.g to implement the list_locations for NephoScale, we need to see what the response is after requesting the locations (via https://api.nephoscale.com/datacenter/zone/)

>>> conn.connection.request('/datacenter/zone/').object
{'success': True,
'total_count': 2,
'subcode': 0,
'message': 'Your request was processed successfully.',
'data': [
    {'datacenter': {
        'id': 1,
        'airport_code': 'SJC',
        'name': 'SJC-1',
        'uri': 'https://api.nephoscale.com/datacenter/1/'},
    'uri': 'https://api.nephoscale.com/datacenter/zone/86945/',
    'name': 'SJC-1',
    'id': 86945},
    {'datacenter': {
        'id': 3,
        'airport_code': 'RIC',
        'name': 'RIC-1',
        'uri': 'https://api.nephoscale.com/datacenter/3/'},
    'uri': 'https://api.nephoscale.com/datacenter/zone/87729/',
    'name': 'RIC-1',
    'id': 87729}],
'response': 200}

Contributing to Libcloud

Having created the compute driver for our cloud provider and having tested it's functionality, it is time to commit it to Libcloud. In this case we need to write tests, ideally for all functionality, add some fixtures and make sure the tests pass. Libcloud contains unit tests for nearly everything and enforces testing for new functionality. Testing on Libcloud deserves another post by itself.

Until then, check out details on how to contribute in the official project website. The code will be reviewed and most probably comments and suggestions will arise.

So have fun developing your driver!

Special thanks to Tomaz Muraus, maintainer of Liblcoud and it's bigger labourer of love, for proof reading this post.