Debugging Ansible modules

I worked on Ansible automation in the past years, and one of the few cons using this tool is the complexity of a deep debug operation. Variables assignment can be easily spotted, but when Python modules are involved, finding the clue can be more challenging.

How it started

Few hours ago I was installing a CEPH cluster using the ceph-ansible tool. I’m used to it, but it was my first time installing the latest version of CEPH (codename Octopus), so I was reviewing a Nautilus well defined playbook configuration to reuse it.

I launched the playbook, solved some configuration issues, and re-ran the playbook. At last I saw a strange error when a task failed.

fatal: [<my_hostname>]: FAILED! => changed=true 
  cmd:
  - ceph
  - --cluster
  - ceph
  - dashboard
  - ac-user-set-roles
  - VALUE_SPECIFIED_IN_NO_LOG_PARAMETER
  - '********istrator'
  delta: '0:00:04.784677'
  end: '2020-11-17 18:30:18.164192'
  invocation:
    module_args:
      cluster: ceph
      name: VALUE_SPECIFIED_IN_NO_LOG_PARAMETER
      password: VALUE_SPECIFIED_IN_NO_LOG_PARAMETER
      roles:
      - '********istrator'
      state: present
  rc: 2
  start: '2020-11-17 18:30:13.379515'
  stderr: 'Error ENOENT: User ''********'' does not exist'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

At this point, as usual, I enabled a more verbose output level adding -vvv to the command line, but nothing useful appeared.

Ok, now it’s time to further investigate. This is a fragment of the tasks in roles/ceph-dashboard/tasks/configure_dashboard.yml:

- name: create dashboard admin user
  ceph_dashboard_user:
    name: "{{ dashboard_admin_user }}"
    cluster: "{{ cluster }}"
    password: "{{ dashboard_admin_password }}"
    roles: ["{{ 'read-only' if dashboard_admin_user_ro | bool else 'administrator' }}"]
  run_once: true
  delegate_to: "{{ groups[mon_group_name][0] }}"
  environment:
    CEPH_CONTAINER_IMAGE: "{{ ceph_docker_registry + '/' + ceph_docker_image + ':' + ceph_docker_image_tag if containerized_deployment | bool else None }}"
    CEPH_CONTAINER_BINARY: "{{ container_binary }}"

It’s clear that the task is trying to create a new user for the CEPH dashboard with the related password and assigned role (administrator). Now, I wanted to check if all the variables are properly set at runtime. So I added the following at the end of the task:

  debugger: on_failed

This way when the task fails, a debugger prompt will be availble, like:

Debugger invoked
[<my_hostname>] TASK: ceph-dashboard : create dashboard admin user (debug)> p task.args

As you can see, I typed p task.args, as a result, all the task vars has been shown. I double checked everything, but nothing was wrong.

So I tried to deeply investigate. If you recall the task definition, you can find a Python module reference:

- name: create dashboard admin user
  ceph_dashboard_user:

Now, grepping for the module name returned the Python code file path:

# grep -r 'ceph_dashboard_user:' | grep '.py'
library/ceph_dashboard_user.py:  ceph_dashboard_user:
library/ceph_dashboard_user.py:  ceph_dashboard_user:
library/ceph_dashboard_user.py:  ceph_dashboard_user:
library/ceph_dashboard_user.py:  ceph_dashboard_user:
library/ceph_dashboard_user.py:  ceph_dashboard_user:

Ok, let’s see what the code can tell us. Even if you are not a Python expert (as I’m not), you can easily find the related code portion:

def run_module():
  ...
    if state == "present":
        rc, cmd, out, err = exec_commands(module, get_user(module, container_image=container_image))
        if rc == 0:
            user = json.loads(out)
            user['roles'].sort()
            roles.sort()
            if user['roles'] != roles:
                rc, cmd, out, err = exec_commands(module, set_roles(module, container_image=container_image))
                changed = True
            rc, cmd, out, err = exec_commands(module, set_password(module, container_image=container_image))
        else:
            rc, cmd, out, err = exec_commands(module, create_user(module, container_image=container_image))
            rc, cmd, out, err = exec_commands(module, set_roles(module, container_image=container_image))
            changed = True

    elif state == "absent":
        rc, cmd, out, err = exec_commands(module, get_user(module, container_image=container_image))
        if rc == 0:
            rc, cmd, out, err = exec_commands(module, remove_user(module, container_image=container_image))
            changed = True
        else:
            rc = 0
            out = "Dashboard User {} doesn't exist".format(name)

    elif state == "info":
        rc, cmd, out, err = exec_commands(module, get_user(module, container_image=container_image))

    exit_module(module=module, out=out, rc=rc, cmd=cmd, err=err, startd=startd, changed=changed)

Basically, if the specified user exists, the code assigns the wanted role(s) to it. If the user does not exist, it will be created before the role assignment. In our case the user needs to be created, so it’s only a matter of time before you spot the problem:

 else:
     rc, cmd, out, err = exec_commands(module, create_user(module, container_image=container_image))
     rc, cmd, out, err = exec_commands(module, set_roles(module, container_image=container_image))
     changed = True

And this is the problem: two function calls and no return code check between them. Probably, the first one fails, and the second fails because no user has been created at previous step. This perfectly fit the error Error ENOENT: User ''********'' does not exist.

But the point is “why is the first statement failing?”. OK, now the dirty part comes. How can we debug the Python code? The answer can be obvious: with a Python debugger. But how to use it in this particular case?

Ok, we start from which is and where is the code? If you ran your playbooks using -vvv before, you probably saw something like:

TASK [ceph-dashboard : create dashboard admin user] ******************************************************************************************************************************************************************************************
task path: /root/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml:132
   ...
<hostname> PUT /root/.ansible/tmp/ansible-local-28799ktuew6ct/tmpmdh5z9v3 TO /home/ansible/.ansible/tmp/ansible-tmp-1605637811.2138333-38572-176257464667380/AnsiballZ_ceph_dashboard_user.py

This is showing exactly where the code is being copied on the target host(s). But Ansible is like a well trained batallion, occupies, wins and finally leaves, so injected code is usually removed after execution. To make the code files persist we can set the env var ANSIBLE_KEEP_REMOTE_FILES to 1. We can do it when the playbook is launched:

ANSIBLE_KEEP_REMOTE_FILES=1 ansible-playbook site.yml -vvvv

Ok, now wait for the playbook to fail, stop the execution and SSH into the target host. Go to the dir we found earlier, and have a look:

# cd /home/ansible/.ansible/tmp/ansible-tmp-1605637811.2138333-38572-176257464667380/

# ls
AnsiballZ_ceph_dashboard_user.py

Ok, now this Python scripts contains all the code to be executed for the related task. To obtain the code in a readable format, we must extract the content from it:

python3 AnsiballZ_ceph_dashboard_user.py explode

A dir named debug_dir/ has been created, it contains all the code:

debug_dir/
├── ansible
│   ├── __init__.py
│   ├── module_utils
│   │   ├── __init__.py
│   │   ├── _text.py
│   │   ├── basic.py
│   │   ├── common
│   │   │   ├── __init__.py
│   │   │   ├── _collections_compat.py
│   │   │   ├── _json_compat.py
│   │   │   ├── _utils.py
│   │   │   ├── collections.py
│   │   │   ├── file.py
│   │   │   ├── parameters.py
│   │   │   ├── process.py
│   │   │   ├── sys_info.py
│   │   │   ├── text
│   │   │   │   ├── __init__.py
│   │   │   │   ├── converters.py
│   │   │   │   └── formatters.py
│   │   │   └── validation.py
│   │   ├── compat
│   │   │   ├── __init__.py
│   │   │   ├── _selectors2.py
│   │   │   └── selectors.py
│   │   ├── distro
│   │   │   ├── __init__.py
│   │   │   └── _distro.py
│   │   ├── parsing
│   │   │   ├── __init__.py
│   │   │   └── convert_bool.py
│   │   ├── pycompat24.py
│   │   └── six
│   │       └── __init__.py
│   └── modules
│       ├── __init__.py
│       └── ceph_dashboard_user.py
└── args

A file containing the arguments passed to the module, called args, will be also available inside it.

We are interested in debugging the file we previously read, ceph_dashboard_user.py. So edit the file copy under debug_dir/ansible/modules/ceph_dashboard_user.py. Now we need to find again the two lines we found previously and insert a breakpoint just before. It simply implies inserting a line as follows:

else:
    import pdb; pdb.set_trace()
    rc, cmd, out, err = exec_commands(module, create_user(module, container_image=container_image))
    rc, cmd, out, err = exec_commands(module, set_roles(module, container_image=container_image))

To debug the code we’ll use pdb, so ensure it is properly installed in your system. Launch the extracted and modified code with:

python3 AnsiballZ_ceph_dashboard_user.py execute

pdb will reach the breakpoint and wait for further instructions:

-> rc, cmd, out, err = exec_commands(module, create_user(module, container_image=container_image))
(Pdb)

Now we should tell pdb to execute the next step typing n (next):

(Pdb) n
-> rc, cmd, out, err = exec_commands(module, set_roles(module, container_image=container_image))

At this point point the debugger has executed the first statement, and is about to execute the second one, and again is waiting for more input. Ok, that’s the hot moment: was the just executed line successful? Only few variables can give us this information. We’ll use p (print) to see variables content, look at rc (it stands for return code, I suppose):

(Pdb) p rc
2

Oh, but the code if different from 0, so it is failing, as we guessed earlier. OK, but why? Now check the error message contained in err variable:

(Pdb) p err
'Error EINVAL: Password is too weak.\n'

Got it! My bad, I used a trivial password in my playbook config, and user creation failed for this reason. It’s obviously my fault, but the code is merely lacking of good practice while not checking the return code between the to calls.

I changed the password using a stronger one, re-ran the playbook and everything went ok.

That’s all, now you know how to debug an Ansible module.

Enjoy.

Debugging Ansible modules

How it started

loudgefly