Commits · 6.2.2 · OpenStack / Kolla Ansible

Aug 27, 2019

Constrain the size of Docker logs · 403cc371

Doug Szumski authored 6 years ago

Even though Kolla services are configured to log output to file rather than
stdout, some stdout still occurs when for example the container re(starts).
Since the Docker logs are not constrained in size, they can fill up the
docker volumes drive and bring down the host. One example of when this is
particularly problematic is when Fluentd cannot parse a log message. The
warning output is written to the Docker log and in production we have seen
it eat 100GB of disk space in less than a day. We could configure Fluentd
not to do this, but the problem may still occur via another mechanism.

Change-Id: Ia6d3935263a5909c71750b34eb69e72e6e558b7a
Closes-Bug: #1794249
(cherry picked from commit bd54b991)

6.2.2

403cc371

Aug 15, 2019
- Merge "Add 'allow *' to getting ceph mds keyring" into stable/queens · 8d70e8bb
  Zuul authored 5 years ago
  
  8d70e8bb
Aug 07, 2019

Merge "Fixes for MariaDB bootstrap and recovery" into stable/queens · e8da24f3
Zuul authored 5 years ago

e8da24f3

Add 'allow *' to getting ceph mds keyring · 7c575f46

Michal Nasiadka authored 5 years ago

* Sometimes getting/creating ceph mds keyring fails, similar to https://tracker.ceph.com/issues/16255

Change-Id: I47587cbeb8be0e782c13ba7f40367409e2daa8a8
(cherry picked from commit 4e3054b5)
(cherry picked from commit a579e19b)
(cherry picked from commit de34995b)

7c575f46

Aug 05, 2019
- Merge "Restart all nova services after upgrade" into stable/queens · a4b89387
  Zuul authored 5 years ago
  
  a4b89387
Aug 03, 2019
- Merge "openvswitch: always run handler to to ensure OVS bridges are up" into stable/queens · 5afcb919
  Zuul authored 5 years ago
  
  5afcb919
- Merge "Avoid parallel discover_hosts (nova-related race condition)" into stable/queens · f00047f2
  Zuul authored 5 years ago
  
  f00047f2
Aug 02, 2019
- Merge "During deploy, always sync DB" into stable/queens · a12183ee
  Zuul authored 5 years ago
  
  a12183ee
Jul 19, 2019

openvswitch: always run handler to to ensure OVS bridges are up · da171abc

David Rabel authored 6 years ago

When editing external bridge configuration and running a reconfigure
on openvswitch, handler "Ensuring OVS bridge is properly setup"
needs to run, but doesn't.

This moves the task from handlers to own file and always includes it
after running the handlers.

Change-Id: Iee39cf00b743ab0776354749c6e162814b5584d8
Closes-Bug: #1794504
(cherry picked from commit 8736817a)

da171abc

Jul 17, 2019
- Merge "Wait for all compute services before cell discovery" into stable/queens · 5224680f
  Zuul authored 5 years ago
  
  5224680f
Jul 16, 2019
- Merge "Language tweaks in multi-region docs for clarity" into stable/queens · 50c27b1f
  Zuul authored 5 years ago
  
  50c27b1f
Jul 12, 2019

Add Region and Multiples into default globals.yml · 849d613c

Raimund Hook authored 5 years ago

Currently, the documentation around configuring regions directs
you to make changes to openstack_region_name and multiple_regions_names
in the globals.yml file.
The defaults weren't represented in there which could potentially cause
confusion. This change adds these defaults with a brief description.

TrivialFix

Change-Id: Ie0ff7e3dfb9a9355a9c9dbaf27151d90162806dd
(cherry picked from commit e72c49ed)

849d613c

During deploy, always sync DB · 8e219a91

Mark Goddard authored 6 years ago

A common class of problems goes like this:

* kolla-ansible deploy
* Hit a problem, often in ansible/roles/*/tasks/bootstrap.yml
* Re-run kolla-ansible deploy
* Service fails to start

This happens because the DB is created during the first run, but for some
reason we fail before performing the DB sync. This means that on the second run
we don't include ansible/roles/*/tasks/bootstrap_service.yml because the DB
already exists, and therefore still don't perform the DB sync. However this
time, the command may complete without apparent error.

We should be less careful about when we perform the DB sync, and do it whenever
it is necessary. There is an argument for not doing the sync during a
'reconfigure' command, although we will not change that here.

This change only always performs the DB sync during 'deploy' and
'reconfigure' commands.

Change-Id: I82d30f3fcf325a3fdff3c59f19a1f88055b566cc
Closes-Bug: #1823766
Closes-Bug: #1797814
(cherry picked from commit d5e5e885...

8e219a91

Language tweaks in multi-region docs for clarity · 65bae5a7

Raimund Hook authored 5 years ago

Tweaked some of the language in doc/source/user/multi-regions.rst for
clarity purposes.

TrivialFix

Change-Id: Icdd8da6886d0e39da5da80c37d14d2688431ba8f

65bae5a7

Jul 11, 2019

Don't rotate keystone fernet keys during deploy · c68cc2c6

Mark Goddard authored 5 years ago

When running deploy or reconfigure for Keystone,
ansible/roles/keystone/tasks/deploy.yml calls init_fernet.yml,
which runs /usr/bin/fernet-rotate.sh, which calls keystone-manage
fernet_rotate.

This means that a token can become invalid if the operator runs
deploy or reconfigure too often.

This change splits out fernet-push.sh from the fernet-rotate.sh
script, then calls fernet-push.sh after the fernet bootstrap
performed in deploy.

Change-Id: I824857ddfb1dd026f93994a4ac8db8f80e64072e
Closes-Bug: #1833729
(cherry picked from commit 09e29d0d)

c68cc2c6

Jul 09, 2019

Wait for all compute services before cell discovery · aa442450

Mark Goddard authored 5 years ago

There is a race condition during nova deploy since we wait for at least
one compute service to register itself before performing cells v2 host
discovery.  It's quite possible that other compute nodes will not yet
have registered and will therefore not be discovered. This leaves them
not mapped into a cell, and results in the following error if the
scheduler picks one when booting an instance:

Host 'xyz' is not mapped to any cell

The problem has been exacerbated by merging a fix [1][2] for a nova race
condition, which disabled the dynamic periodic discovery mechanism in
the nova scheduler.

This change fixes the issue by waiting for all expected compute services
to register themselves before performing host discovery. This includes
both virtualised compute services and bare metal compute services.

This patch also includes change I58f8fd0a6e82cb614e02fef6e5b271af1d1ce9af
which was made to fix an issue with the original version of this patch
running on Ansible<28. See bug 1835817 for details.

Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
Closes-Bug: #1835002
(cherry picked from commit c38dd767)

aa442450

Jul 08, 2019

Fixes for MariaDB bootstrap and recovery · 9ec00c24

Mark Goddard authored 5 years ago

* Fix wsrep sequence number detection. Log message format is
  'WSREP: Recovered position: <UUID>:<seqno>' but we were picking out
  the UUID rather than the sequence number. This is as good as random.

* Add become: true to log file reading and removal since
  I4a5ebcedaccb9261dbc958ec67e8077d7980e496 added become: true to the
  'docker cp' command which creates it.

* Don't run handlers during recovery. If the config files change we
  would end up restarting the cluster twice.

* Wait for wsrep recovery container completion (don't detach). This
  avoids a potential race between wsrep recovery and the subsequent
  'stop_container'.

* Finally, we now wait for the bootstrap host to report that it is in
  an OPERATIONAL state. Without this we can see errors where the
  MariaDB cluster is not ready when used by other services.

Change-Id: Iaf7862be1affab390f811fc485fd0eb6879fd583
Closes-Bug: #1834467
(cherry picked from commit 86f373a1)

9ec00c24

Jul 04, 2019

Restart all nova services after upgrade · 621a4d6f

Mark Goddard authored 5 years ago

During an upgrade, nova pins the version of RPC calls to the minimum
seen across all services. This ensures that old services do not receive
data they cannot handle. After the upgrade is complete, all nova
services are supposed to be reloaded via SIGHUP to cause them to check
again the RPC versions of services and use the new latest version which
should now be supported by all running services.

Due to a bug [1] in oslo.service, sending services SIGHUP is currently
broken. We replaced the HUP with a restart for the nova_compute
container for bug 1821362, but not other nova services. It seems we need
to restart all nova services to allow the RPC version pin to be removed.

Testing in a Queens to Rocky upgrade, we find the following in the logs:

Automatically selected compute RPC version 5.0 from minimum service
version 30

However, the service version in Rocky is 35.

There is a second issue in that it takes some time for the upgraded
services to update the nova services database table with their new
version. We need to wait until all nova-compute services have done this
before the restart is performed, otherwise the RPC version cap will
remain in place. There is currently no interface in nova available for
checking these versions [2], so as a workaround we use a configurable
delay with a default duration of 30 seconds. Testing showed it takes
about 10 seconds for the version to be updated, so this gives us some
headroom.

This change restarts all nova services after an upgrade, after a 30
second delay.

[1] https://bugs.launchpad.net/oslo.service/+bug/1715374
[2] https://bugs.launchpad.net/nova/+bug/1833542

Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
Closes-Bug: #1833069
Related-Bug: #1833542
(cherry picked from commit e6d2b922)

621a4d6f

Jun 30, 2019

(Rocky and Queens only) CI: Fix ceph jobs for kolla · 20b40798

Radosław Piliszek authored 5 years ago


The way Zuul handles job dependencies caused the computed path
to setup_ceph_disks.sh to be wrong when job was run due to kolla
(as opposed to kolla-ansible) change.

This patch uses a relative path as is the case in Stein and later.

Change-Id: I4f45cd18b86ebe0efcc3101e31e4bcafdc83eb57
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
(cherry picked from commit 1d22714c)

20b40798

Jun 28, 2019

Avoid parallel discover_hosts (nova-related race condition) · 73747860

Radosław Piliszek authored 5 years ago


In a rare event both kolla-ansible and nova-scheduler try to do
the mapping at the same time and one of them fails.
Since kolla-ansible runs host discovery on each deployment,
there is no need to change the default of no periodic host discovery.

I added some notes for future. They are not critical.
I made the decision explicit in the comments.
I changed the task name to satisfy recommendations.
I removed the variable because it is not used (to avoid future doubts).

Closes-Bug: #1832987
Change-Id: I3128472f028a2dbd7ace02abc179a9629ad74ceb
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
(cherry picked from commit ce680bcf)
(cherry picked from commit ed3d1e37)

73747860

Jun 27, 2019

Remove zuul-cloner usage · 20782235

Radosław Piliszek authored 5 years ago

Due to zuul-cloner being finally removed from Zuul v3, we have to
remove zuul-cloner usage from supported branches to use CI.
Stein and up have it removed by an unrelated change.
This simple patch is to go to rocky and below.

See [1] for Zuul change.

[1] https://review.opendev.org/663151



Change-Id: Id4ac65148daf09cb192789a1d73871811cdea342
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
(cherry picked from commit 924226df)

20782235

Jun 24, 2019
- Merge "Fix keystone fernet key rotation scheduling" into stable/queens · 2fdc25a7
  Zuul authored 5 years ago
  
  2fdc25a7
- Merge "Add unit test for keystone fernet cron generator" into stable/queens · 2181937f
  Zuul authored 5 years ago
  
  2181937f
Jun 19, 2019
- Merge "nova: Fix DBNotAllowed during compute startup" into stable/queens · 2379512b
  Zuul authored 5 years ago
  
  2379512b
- Merge "[heat] Multi-region support for bootstrap" into stable/queens · a16d782a
  Zuul authored 5 years ago
  
  a16d782a
- Merge "Add ansible_nodename (system hostname) to /etc/hosts" into stable/queens · ff902e6f
  Zuul authored 5 years ago
  
  ff902e6f
- Merge "Fix Blazar Nova aggregate in multi-region setup" into stable/queens · e5ec1384
  Zuul authored 5 years ago
  
  e5ec1384
- Merge "Support multi-region discovery of Nova cells" into stable/queens · 0bc0aa46
  Zuul authored 5 years ago
  
  0bc0aa46
- Merge "Hide logs when looping over passwords" into stable/queens · 7d7de91b
  Zuul authored 5 years ago
  
  7d7de91b
Jun 18, 2019

Add blazar to fluentd aggregation · 01654d83

Cody Hammock authored 6 years ago

If Blazar is enabled, ensure that fluentd processes its logs.

Change-Id: If71d5c056c042667388dae8e4ee6d51a5ecab46e
(cherry picked from commit 2c343562)

01654d83

[heat] Multi-region support for bootstrap · 22c012c9

Jason authored 6 years ago

When bootstrapping, Heat was not setting a region explicitly, so it
could default to a region other than the one being deployed.

Change-Id: I0a0596a020fbff91ccc5b9f44f271eab220c88cd
(cherry picked from commit 44da1963)

22c012c9

Fix Blazar Nova aggregate in multi-region setup · 489a0a31

Jason authored 6 years ago

The Nova aggregate was always defaulting to some region (usually first
in the Keystone endpoint list) when registering the Nova aggregate for
Blazar. Add in a region override to ensure we are always writing to the
region being deployed.

Change-Id: I3f921ac51acab1b1020a459c07c755af7023e026
(cherry picked from commit f20cbf49)

489a0a31

Hide logs when looping over passwords · 554187c9

Jason authored 6 years ago

When ansible goes in to a loop, by default it prints all the keys for
the item it is looping over. Some roles, when setting up the databases,
iterate over an object that includes the database password.

Override the loop label to hide everything but the database name.

Change-Id: I336a81a5ecd824ace7d40e9a35942a1c853554cd
(cherry picked from commit 30c619d1)

554187c9

Support multi-region discovery of Nova cells · 18d8960e

Jason authored 6 years ago

In a multi-region environment, each region is being deployed separately.
Cell discovery, however, would sometimes fail due to it picking a region
different than the one being deployed. Most likely, an internal endpoint
for region A will not be visible from region B. Furthermore, it is not
very useful to discover hosts on a region you're not modifying.

This changes the check to only run against nova compute services located
in the region being deployed.

Change-Id: I21eb1164c2f67098b81edbd5cc106472663b92cb
(cherry picked from commit 328e1425)

18d8960e

Add ansible_nodename (system hostname) to /etc/hosts · d6bac914

Pierre Riteau authored 5 years ago

Kolla-Ansible populates /etc/hosts with overcloud hosts using their API
interface IP address. When configured correctly, this allows Nova to use
the API interface for live migration of instances between compute hosts.

The hostname used is from the `ansible_hostname` variable, which is a
short hostname generated by Ansible using the first dot as a delimiter.
However, Nova defaults to use the result of socket.gethostname() to
register nova-compute services.

In deployments where hostnames are set to FQDNs, for example when using
FreeIPA, nova-compute would try to reach the other compute node using
its FQDN (as registered in the Nova database), which was absent from
/etc/hosts. This can result in failures to live migrate instances if
DNS entries don't match.

This commit populates /etc/hosts with `ansible_nodename` (hostname as
reported by the system) in addition to `ansible_hostname`, if they are
different.

Change-Id: Id058aa1db8d60c979680e6a41f7f3e1c39f98235
Closes-Bug: #1830023
(cherry picked from commit 37899026)

d6bac914

nova: Fix DBNotAllowed during compute startup · 11b9eba2

Mark Goddard authored 5 years ago

backport: stein, rocky

During startup of nova-compute, we see the following error message:

Error gathering result from cell 00000000-0000-0000-0000-000000000000:
DBNotAllowed: nova-compute

This issue was observed in devstack [1], and fixed [2] by removing
database configuration from the compute service.

This change takes the same approach, removing DB config from nova.conf
in the nova-compute* containers.

[1] https://bugs.launchpad.net/devstack/+bug/1812398
[2] https://opendev.org/openstack/devstack/commit/82537871376afe98a286e1ba424cf192ae60869a

Change-Id: I18c99ff4213ce456868e64eab63a4257910b9b8e
Closes-Bug: #1829705
(cherry picked from commit 002eec95)

11b9eba2

Fix keystone fernet key rotation scheduling · d5cef35a

Mark Goddard authored 5 years ago

Right now every controller rotates fernet keys. This is nice because
should any controller die, we know the remaining ones will rotate the
keys. However, we are currently over-rotating the keys.

When we over rotate keys, we get logs like this:

 This is not a recognized Fernet token <token> TokenNotFound

Most clients can recover and get a new token, but some clients (like
Nova passing tokens to other services) can't do that because it doesn't
have the password to regenerate a new token.

With three controllers, in crontab in keystone-fernet we see the once a day
correctly staggered across the three controllers:

ssh ctrl1 sudo cat /etc/kolla/keystone-fernet/crontab
0 0 * * * /usr/bin/fernet-rotate.sh
ssh ctrl2 sudo cat /etc/kolla/keystone-fernet/crontab
0 8 * * * /usr/bin/fernet-rotate.sh
ssh ctrl3 sudo cat /etc/kolla/keystone-fernet/crontab
0 16 * * * /usr/bin/fernet-rotate.sh

Currently with three controllers we have this keystone config:

[token]
expiration = 86400 (although, keystone default is one hour)
allow_expired_window = 172800 (this is the keystone default)

[fernet_tokens]
max_active_keys = 4

Currently, kolla-ansible configures key rotation according to the following:

   rotation_interval = token_expiration / num_hosts

This means we rotate keys more quickly the more hosts we have, which doesn't
make much sense.

Keystone docs state:

   max_active_keys =
     ((token_expiration + allow_expired_window) / rotation_interval) + 2

For details see:
https://docs.openstack.org/keystone/stein/admin/fernet-token-faq.html

Rotation is based on pushing out a staging key, so should any server
start using that key, other servers will consider that valid. Then each
server in turn starts using the staging key, each in term demoting the
existing primary key to a secondary key. Eventually you prune the
secondary keys when there is no token in the wild that would need to be
decrypted using that key. So this all makes sense.

This change adds new variables for fernet_token_allow_expired_window and
fernet_key_rotation_interval, so that we can correctly calculate the
correct number of active keys. We now set the default rotation interval
so as to minimise the number of active keys to 3 - one primary, one
secondary, one buffer.

This change also fixes the fernet cron job generator, which was broken
in the following cases:

* requesting an interval of more than 1 day resulted in no jobs
* requesting an interval of more than 60 minutes, unless an exact
  multiple of 60 minutes, resulted in no jobs

It should now be possible to request any interval up to a week divided
by the number of hosts.

Change-Id: I10c82dc5f83653beb60ddb86d558c5602153341a
Closes-Bug: #1809469
(cherry picked from commit 6c1442c3)

d5cef35a

Add unit test for keystone fernet cron generator · ec2aa48c

Mark Goddard authored 5 years ago

Before making changes to this script, document its behaviour with a unit
test.

There are two major issues:

* requesting an interval of more than 1 day results in no jobs
* requesting an interval of more than 60 minutes, unless an exact
  multiple of 60 minutes, results in no jobs

Change-Id: I655da1102dfb4ca12437b7db0b79c9a61568f79e
Related-Bug: #1809469
(cherry picked from commit 25ac955a)

ec2aa48c

Jun 13, 2019

Stop duplicating Nova cells · f6cfee32

Pierre Riteau authored 5 years ago

Check if a base Nova cell already exists before calling `nova-manage
cell_v2 create_cell`, which would otherwise create a duplicate cell when
the transport URL or database connection change.

If a base cell already exists but the connection values have changed, we
now call `nova-manage cell_v2 update_cell` instead. This is only
possible if a duplicate cell has not yet been created. If one already
exists, we print a warning inviting the operator to perform a manual
cleanup. We don't use a hard fail to avoid an abrupt change of behavior
if this is backported to stable branches.

Change-Id: I7841ce0cff08e315fd7761d84e1e681b1a00d43e
Closes-Bug: #1734872
(cherry picked from commit 19b8dbe4)

f6cfee32

May 31, 2019
- Merge "Fix kolla-docker possible undefined variable" into stable/queens · e1a74bae
  Zuul authored 5 years ago
  
  6.2.1
  
  e1a74bae