Snippets Groups Projects
  1. Aug 27, 2019
    • Doug Szumski's avatar
      Constrain the size of Docker logs · 403cc371
      Doug Szumski authored
      Even though Kolla services are configured to log output to file rather than
      stdout, some stdout still occurs when for example the container re(starts).
      Since the Docker logs are not constrained in size, they can fill up the
      docker volumes drive and bring down the host. One example of when this is
      particularly problematic is when Fluentd cannot parse a log message. The
      warning output is written to the Docker log and in production we have seen
      it eat 100GB of disk space in less than a day. We could configure Fluentd
      not to do this, but the problem may still occur via another mechanism.
      
      Change-Id: Ia6d3935263a5909c71750b34eb69e72e6e558b7a
      Closes-Bug: #1794249
      (cherry picked from commit bd54b991)
      6.2.2
      403cc371
  2. Aug 15, 2019
  3. Aug 07, 2019
  4. Aug 05, 2019
  5. Aug 03, 2019
  6. Aug 02, 2019
  7. Jul 19, 2019
    • David Rabel's avatar
      openvswitch: always run handler to to ensure OVS bridges are up · da171abc
      David Rabel authored
      When editing external bridge configuration and running a reconfigure
      on openvswitch, handler "Ensuring OVS bridge is properly setup"
      needs to run, but doesn't.
      
      This moves the task from handlers to own file and always includes it
      after running the handlers.
      
      Change-Id: Iee39cf00b743ab0776354749c6e162814b5584d8
      Closes-Bug: #1794504
      (cherry picked from commit 8736817a)
      da171abc
  8. Jul 17, 2019
  9. Jul 16, 2019
  10. Jul 12, 2019
    • Raimund Hook's avatar
      Add Region and Multiples into default globals.yml · 849d613c
      Raimund Hook authored
      Currently, the documentation around configuring regions directs
      you to make changes to openstack_region_name and multiple_regions_names
      in the globals.yml file.
      The defaults weren't represented in there which could potentially cause
      confusion. This change adds these defaults with a brief description.
      
      TrivialFix
      
      Change-Id: Ie0ff7e3dfb9a9355a9c9dbaf27151d90162806dd
      (cherry picked from commit e72c49ed)
      849d613c
    • Mark Goddard's avatar
      During deploy, always sync DB · 8e219a91
      Mark Goddard authored
      A common class of problems goes like this:
      
      * kolla-ansible deploy
      * Hit a problem, often in ansible/roles/*/tasks/bootstrap.yml
      * Re-run kolla-ansible deploy
      * Service fails to start
      
      This happens because the DB is created during the first run, but for some
      reason we fail before performing the DB sync. This means that on the second run
      we don't include ansible/roles/*/tasks/bootstrap_service.yml because the DB
      already exists, and therefore still don't perform the DB sync. However this
      time, the command may complete without apparent error.
      
      We should be less careful about when we perform the DB sync, and do it whenever
      it is necessary. There is an argument for not doing the sync during a
      'reconfigure' command, although we will not change that here.
      
      This change only always performs the DB sync during 'deploy' and
      'reconfigure' commands.
      
      Change-Id: I82d30f3fcf325a3fdff3c59f19a1f88055b566cc
      Closes-Bug: #1823766
      Closes-Bug: #1797814
      (cherry picked from commit d5e5e885...
      8e219a91
    • Raimund Hook's avatar
      Language tweaks in multi-region docs for clarity · 65bae5a7
      Raimund Hook authored
      Tweaked some of the language in doc/source/user/multi-regions.rst for
      clarity purposes.
      
      TrivialFix
      
      Change-Id: Icdd8da6886d0e39da5da80c37d14d2688431ba8f
      65bae5a7
  11. Jul 11, 2019
    • Mark Goddard's avatar
      Don't rotate keystone fernet keys during deploy · c68cc2c6
      Mark Goddard authored
      When running deploy or reconfigure for Keystone,
      ansible/roles/keystone/tasks/deploy.yml calls init_fernet.yml,
      which runs /usr/bin/fernet-rotate.sh, which calls keystone-manage
      fernet_rotate.
      
      This means that a token can become invalid if the operator runs
      deploy or reconfigure too often.
      
      This change splits out fernet-push.sh from the fernet-rotate.sh
      script, then calls fernet-push.sh after the fernet bootstrap
      performed in deploy.
      
      Change-Id: I824857ddfb1dd026f93994a4ac8db8f80e64072e
      Closes-Bug: #1833729
      (cherry picked from commit 09e29d0d)
      c68cc2c6
  12. Jul 09, 2019
    • Mark Goddard's avatar
      Wait for all compute services before cell discovery · aa442450
      Mark Goddard authored
      There is a race condition during nova deploy since we wait for at least
      one compute service to register itself before performing cells v2 host
      discovery.  It's quite possible that other compute nodes will not yet
      have registered and will therefore not be discovered. This leaves them
      not mapped into a cell, and results in the following error if the
      scheduler picks one when booting an instance:
      
      Host 'xyz' is not mapped to any cell
      
      The problem has been exacerbated by merging a fix [1][2] for a nova race
      condition, which disabled the dynamic periodic discovery mechanism in
      the nova scheduler.
      
      This change fixes the issue by waiting for all expected compute services
      to register themselves before performing host discovery. This includes
      both virtualised compute services and bare metal compute services.
      
      This patch also includes change I58f8fd0a6e82cb614e02fef6e5b271af1d1ce9af
      which was made to fix an issue with the original version of this patch
      running on Ansible<28. See bug 1835817 for details.
      
      Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
      Closes-Bug: #1835002
      (cherry picked from commit c38dd767)
      aa442450
  13. Jul 08, 2019
    • Mark Goddard's avatar
      Fixes for MariaDB bootstrap and recovery · 9ec00c24
      Mark Goddard authored
      * Fix wsrep sequence number detection. Log message format is
        'WSREP: Recovered position: <UUID>:<seqno>' but we were picking out
        the UUID rather than the sequence number. This is as good as random.
      
      * Add become: true to log file reading and removal since
        I4a5ebcedaccb9261dbc958ec67e8077d7980e496 added become: true to the
        'docker cp' command which creates it.
      
      * Don't run handlers during recovery. If the config files change we
        would end up restarting the cluster twice.
      
      * Wait for wsrep recovery container completion (don't detach). This
        avoids a potential race between wsrep recovery and the subsequent
        'stop_container'.
      
      * Finally, we now wait for the bootstrap host to report that it is in
        an OPERATIONAL state. Without this we can see errors where the
        MariaDB cluster is not ready when used by other services.
      
      Change-Id: Iaf7862be1affab390f811fc485fd0eb6879fd583
      Closes-Bug: #1834467
      (cherry picked from commit 86f373a1)
      9ec00c24
  14. Jul 04, 2019
    • Mark Goddard's avatar
      Restart all nova services after upgrade · 621a4d6f
      Mark Goddard authored
      During an upgrade, nova pins the version of RPC calls to the minimum
      seen across all services. This ensures that old services do not receive
      data they cannot handle. After the upgrade is complete, all nova
      services are supposed to be reloaded via SIGHUP to cause them to check
      again the RPC versions of services and use the new latest version which
      should now be supported by all running services.
      
      Due to a bug [1] in oslo.service, sending services SIGHUP is currently
      broken. We replaced the HUP with a restart for the nova_compute
      container for bug 1821362, but not other nova services. It seems we need
      to restart all nova services to allow the RPC version pin to be removed.
      
      Testing in a Queens to Rocky upgrade, we find the following in the logs:
      
      Automatically selected compute RPC version 5.0 from minimum service
      version 30
      
      However, the service version in Rocky is 35.
      
      There is a second issue in that it takes some time for the upgraded
      services to update the nova services database table with their new
      version. We need to wait until all nova-compute services have done this
      before the restart is performed, otherwise the RPC version cap will
      remain in place. There is currently no interface in nova available for
      checking these versions [2], so as a workaround we use a configurable
      delay with a default duration of 30 seconds. Testing showed it takes
      about 10 seconds for the version to be updated, so this gives us some
      headroom.
      
      This change restarts all nova services after an upgrade, after a 30
      second delay.
      
      [1] https://bugs.launchpad.net/oslo.service/+bug/1715374
      [2] https://bugs.launchpad.net/nova/+bug/1833542
      
      Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
      Closes-Bug: #1833069
      Related-Bug: #1833542
      (cherry picked from commit e6d2b922)
      621a4d6f
  15. Jun 30, 2019
  16. Jun 28, 2019
    • Radosław Piliszek's avatar
      Avoid parallel discover_hosts (nova-related race condition) · 73747860
      Radosław Piliszek authored
      
      In a rare event both kolla-ansible and nova-scheduler try to do
      the mapping at the same time and one of them fails.
      Since kolla-ansible runs host discovery on each deployment,
      there is no need to change the default of no periodic host discovery.
      
      I added some notes for future. They are not critical.
      I made the decision explicit in the comments.
      I changed the task name to satisfy recommendations.
      I removed the variable because it is not used (to avoid future doubts).
      
      Closes-Bug: #1832987
      Change-Id: I3128472f028a2dbd7ace02abc179a9629ad74ceb
      Signed-off-by: default avatarRadosław Piliszek <radoslaw.piliszek@gmail.com>
      (cherry picked from commit ce680bcf)
      (cherry picked from commit ed3d1e37)
      73747860
  17. Jun 27, 2019
  18. Jun 24, 2019
  19. Jun 19, 2019
  20. Jun 18, 2019
    • Cody Hammock's avatar
      Add blazar to fluentd aggregation · 01654d83
      Cody Hammock authored
      If Blazar is enabled, ensure that fluentd processes its logs.
      
      Change-Id: If71d5c056c042667388dae8e4ee6d51a5ecab46e
      (cherry picked from commit 2c343562)
      01654d83
    • Jason's avatar
      [heat] Multi-region support for bootstrap · 22c012c9
      Jason authored
      When bootstrapping, Heat was not setting a region explicitly, so it
      could default to a region other than the one being deployed.
      
      Change-Id: I0a0596a020fbff91ccc5b9f44f271eab220c88cd
      (cherry picked from commit 44da1963)
      22c012c9
    • Jason's avatar
      Fix Blazar Nova aggregate in multi-region setup · 489a0a31
      Jason authored
      The Nova aggregate was always defaulting to some region (usually first
      in the Keystone endpoint list) when registering the Nova aggregate for
      Blazar. Add in a region override to ensure we are always writing to the
      region being deployed.
      
      Change-Id: I3f921ac51acab1b1020a459c07c755af7023e026
      (cherry picked from commit f20cbf49)
      489a0a31
    • Jason's avatar
      Hide logs when looping over passwords · 554187c9
      Jason authored
      When ansible goes in to a loop, by default it prints all the keys for
      the item it is looping over. Some roles, when setting up the databases,
      iterate over an object that includes the database password.
      
      Override the loop label to hide everything but the database name.
      
      Change-Id: I336a81a5ecd824ace7d40e9a35942a1c853554cd
      (cherry picked from commit 30c619d1)
      554187c9
    • Jason's avatar
      Support multi-region discovery of Nova cells · 18d8960e
      Jason authored
      In a multi-region environment, each region is being deployed separately.
      Cell discovery, however, would sometimes fail due to it picking a region
      different than the one being deployed. Most likely, an internal endpoint
      for region A will not be visible from region B. Furthermore, it is not
      very useful to discover hosts on a region you're not modifying.
      
      This changes the check to only run against nova compute services located
      in the region being deployed.
      
      Change-Id: I21eb1164c2f67098b81edbd5cc106472663b92cb
      (cherry picked from commit 328e1425)
      18d8960e
    • Pierre Riteau's avatar
      Add ansible_nodename (system hostname) to /etc/hosts · d6bac914
      Pierre Riteau authored
      Kolla-Ansible populates /etc/hosts with overcloud hosts using their API
      interface IP address. When configured correctly, this allows Nova to use
      the API interface for live migration of instances between compute hosts.
      
      The hostname used is from the `ansible_hostname` variable, which is a
      short hostname generated by Ansible using the first dot as a delimiter.
      However, Nova defaults to use the result of socket.gethostname() to
      register nova-compute services.
      
      In deployments where hostnames are set to FQDNs, for example when using
      FreeIPA, nova-compute would try to reach the other compute node using
      its FQDN (as registered in the Nova database), which was absent from
      /etc/hosts. This can result in failures to live migrate instances if
      DNS entries don't match.
      
      This commit populates /etc/hosts with `ansible_nodename` (hostname as
      reported by the system) in addition to `ansible_hostname`, if they are
      different.
      
      Change-Id: Id058aa1db8d60c979680e6a41f7f3e1c39f98235
      Closes-Bug: #1830023
      (cherry picked from commit 37899026)
      d6bac914
    • Mark Goddard's avatar
      nova: Fix DBNotAllowed during compute startup · 11b9eba2
      Mark Goddard authored
      backport: stein, rocky
      
      During startup of nova-compute, we see the following error message:
      
      Error gathering result from cell 00000000-0000-0000-0000-000000000000:
      DBNotAllowed: nova-compute
      
      This issue was observed in devstack [1], and fixed [2] by removing
      database configuration from the compute service.
      
      This change takes the same approach, removing DB config from nova.conf
      in the nova-compute* containers.
      
      [1] https://bugs.launchpad.net/devstack/+bug/1812398
      [2] https://opendev.org/openstack/devstack/commit/82537871376afe98a286e1ba424cf192ae60869a
      
      Change-Id: I18c99ff4213ce456868e64eab63a4257910b9b8e
      Closes-Bug: #1829705
      (cherry picked from commit 002eec95)
      11b9eba2
    • Mark Goddard's avatar
      Fix keystone fernet key rotation scheduling · d5cef35a
      Mark Goddard authored
      Right now every controller rotates fernet keys. This is nice because
      should any controller die, we know the remaining ones will rotate the
      keys. However, we are currently over-rotating the keys.
      
      When we over rotate keys, we get logs like this:
      
       This is not a recognized Fernet token <token> TokenNotFound
      
      Most clients can recover and get a new token, but some clients (like
      Nova passing tokens to other services) can't do that because it doesn't
      have the password to regenerate a new token.
      
      With three controllers, in crontab in keystone-fernet we see the once a day
      correctly staggered across the three controllers:
      
      ssh ctrl1 sudo cat /etc/kolla/keystone-fernet/crontab
      0 0 * * * /usr/bin/fernet-rotate.sh
      ssh ctrl2 sudo cat /etc/kolla/keystone-fernet/crontab
      0 8 * * * /usr/bin/fernet-rotate.sh
      ssh ctrl3 sudo cat /etc/kolla/keystone-fernet/crontab
      0 16 * * * /usr/bin/fernet-rotate.sh
      
      Currently with three controllers we have this keystone config:
      
      [token]
      expiration = 86400 (although, keystone default is one hour)
      allow_expired_window = 172800 (this is the keystone default)
      
      [fernet_tokens]
      max_active_keys = 4
      
      Currently, kolla-ansible configures key rotation according to the following:
      
         rotation_interval = token_expiration / num_hosts
      
      This means we rotate keys more quickly the more hosts we have, which doesn't
      make much sense.
      
      Keystone docs state:
      
         max_active_keys =
           ((token_expiration + allow_expired_window) / rotation_interval) + 2
      
      For details see:
      https://docs.openstack.org/keystone/stein/admin/fernet-token-faq.html
      
      Rotation is based on pushing out a staging key, so should any server
      start using that key, other servers will consider that valid. Then each
      server in turn starts using the staging key, each in term demoting the
      existing primary key to a secondary key. Eventually you prune the
      secondary keys when there is no token in the wild that would need to be
      decrypted using that key. So this all makes sense.
      
      This change adds new variables for fernet_token_allow_expired_window and
      fernet_key_rotation_interval, so that we can correctly calculate the
      correct number of active keys. We now set the default rotation interval
      so as to minimise the number of active keys to 3 - one primary, one
      secondary, one buffer.
      
      This change also fixes the fernet cron job generator, which was broken
      in the following cases:
      
      * requesting an interval of more than 1 day resulted in no jobs
      * requesting an interval of more than 60 minutes, unless an exact
        multiple of 60 minutes, resulted in no jobs
      
      It should now be possible to request any interval up to a week divided
      by the number of hosts.
      
      Change-Id: I10c82dc5f83653beb60ddb86d558c5602153341a
      Closes-Bug: #1809469
      (cherry picked from commit 6c1442c3)
      d5cef35a
    • Mark Goddard's avatar
      Add unit test for keystone fernet cron generator · ec2aa48c
      Mark Goddard authored
      Before making changes to this script, document its behaviour with a unit
      test.
      
      There are two major issues:
      
      * requesting an interval of more than 1 day results in no jobs
      * requesting an interval of more than 60 minutes, unless an exact
        multiple of 60 minutes, results in no jobs
      
      Change-Id: I655da1102dfb4ca12437b7db0b79c9a61568f79e
      Related-Bug: #1809469
      (cherry picked from commit 25ac955a)
      ec2aa48c
  21. Jun 13, 2019
    • Pierre Riteau's avatar
      Stop duplicating Nova cells · f6cfee32
      Pierre Riteau authored
      Check if a base Nova cell already exists before calling `nova-manage
      cell_v2 create_cell`, which would otherwise create a duplicate cell when
      the transport URL or database connection change.
      
      If a base cell already exists but the connection values have changed, we
      now call `nova-manage cell_v2 update_cell` instead. This is only
      possible if a duplicate cell has not yet been created. If one already
      exists, we print a warning inviting the operator to perform a manual
      cleanup. We don't use a hard fail to avoid an abrupt change of behavior
      if this is backported to stable branches.
      
      Change-Id: I7841ce0cff08e315fd7761d84e1e681b1a00d43e
      Closes-Bug: #1734872
      (cherry picked from commit 19b8dbe4)
      f6cfee32
  22. May 31, 2019