- Aug 27, 2019
-
-
Doug Szumski authored
Even though Kolla services are configured to log output to file rather than stdout, some stdout still occurs when for example the container re(starts). Since the Docker logs are not constrained in size, they can fill up the docker volumes drive and bring down the host. One example of when this is particularly problematic is when Fluentd cannot parse a log message. The warning output is written to the Docker log and in production we have seen it eat 100GB of disk space in less than a day. We could configure Fluentd not to do this, but the problem may still occur via another mechanism. Change-Id: Ia6d3935263a5909c71750b34eb69e72e6e558b7a Closes-Bug: #1794249 (cherry picked from commit bd54b991)
-
- Aug 15, 2019
-
-
Zuul authored
-
- Aug 07, 2019
-
-
Zuul authored
-
Michal Nasiadka authored
* Sometimes getting/creating ceph mds keyring fails, similar to https://tracker.ceph.com/issues/16255 Change-Id: I47587cbeb8be0e782c13ba7f40367409e2daa8a8 (cherry picked from commit 4e3054b5) (cherry picked from commit a579e19b) (cherry picked from commit de34995b)
-
- Aug 05, 2019
-
-
Zuul authored
-
- Aug 03, 2019
- Aug 02, 2019
-
-
Zuul authored
-
- Jul 19, 2019
-
-
David Rabel authored
When editing external bridge configuration and running a reconfigure on openvswitch, handler "Ensuring OVS bridge is properly setup" needs to run, but doesn't. This moves the task from handlers to own file and always includes it after running the handlers. Change-Id: Iee39cf00b743ab0776354749c6e162814b5584d8 Closes-Bug: #1794504 (cherry picked from commit 8736817a)
-
- Jul 17, 2019
-
-
Zuul authored
-
- Jul 16, 2019
-
-
Zuul authored
-
- Jul 12, 2019
-
-
Raimund Hook authored
Currently, the documentation around configuring regions directs you to make changes to openstack_region_name and multiple_regions_names in the globals.yml file. The defaults weren't represented in there which could potentially cause confusion. This change adds these defaults with a brief description. TrivialFix Change-Id: Ie0ff7e3dfb9a9355a9c9dbaf27151d90162806dd (cherry picked from commit e72c49ed)
-
Mark Goddard authored
A common class of problems goes like this: * kolla-ansible deploy * Hit a problem, often in ansible/roles/*/tasks/bootstrap.yml * Re-run kolla-ansible deploy * Service fails to start This happens because the DB is created during the first run, but for some reason we fail before performing the DB sync. This means that on the second run we don't include ansible/roles/*/tasks/bootstrap_service.yml because the DB already exists, and therefore still don't perform the DB sync. However this time, the command may complete without apparent error. We should be less careful about when we perform the DB sync, and do it whenever it is necessary. There is an argument for not doing the sync during a 'reconfigure' command, although we will not change that here. This change only always performs the DB sync during 'deploy' and 'reconfigure' commands. Change-Id: I82d30f3fcf325a3fdff3c59f19a1f88055b566cc Closes-Bug: #1823766 Closes-Bug: #1797814 (cherry picked from commit d5e5e885...
-
Raimund Hook authored
Tweaked some of the language in doc/source/user/multi-regions.rst for clarity purposes. TrivialFix Change-Id: Icdd8da6886d0e39da5da80c37d14d2688431ba8f
-
- Jul 11, 2019
-
-
Mark Goddard authored
When running deploy or reconfigure for Keystone, ansible/roles/keystone/tasks/deploy.yml calls init_fernet.yml, which runs /usr/bin/fernet-rotate.sh, which calls keystone-manage fernet_rotate. This means that a token can become invalid if the operator runs deploy or reconfigure too often. This change splits out fernet-push.sh from the fernet-rotate.sh script, then calls fernet-push.sh after the fernet bootstrap performed in deploy. Change-Id: I824857ddfb1dd026f93994a4ac8db8f80e64072e Closes-Bug: #1833729 (cherry picked from commit 09e29d0d)
-
- Jul 09, 2019
-
-
Mark Goddard authored
There is a race condition during nova deploy since we wait for at least one compute service to register itself before performing cells v2 host discovery. It's quite possible that other compute nodes will not yet have registered and will therefore not be discovered. This leaves them not mapped into a cell, and results in the following error if the scheduler picks one when booting an instance: Host 'xyz' is not mapped to any cell The problem has been exacerbated by merging a fix [1][2] for a nova race condition, which disabled the dynamic periodic discovery mechanism in the nova scheduler. This change fixes the issue by waiting for all expected compute services to register themselves before performing host discovery. This includes both virtualised compute services and bare metal compute services. This patch also includes change I58f8fd0a6e82cb614e02fef6e5b271af1d1ce9af which was made to fix an issue with the original version of this patch running on Ansible<28. See bug 1835817 for details. Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921 Closes-Bug: #1835002 (cherry picked from commit c38dd767)
-
- Jul 08, 2019
-
-
Mark Goddard authored
* Fix wsrep sequence number detection. Log message format is 'WSREP: Recovered position: <UUID>:<seqno>' but we were picking out the UUID rather than the sequence number. This is as good as random. * Add become: true to log file reading and removal since I4a5ebcedaccb9261dbc958ec67e8077d7980e496 added become: true to the 'docker cp' command which creates it. * Don't run handlers during recovery. If the config files change we would end up restarting the cluster twice. * Wait for wsrep recovery container completion (don't detach). This avoids a potential race between wsrep recovery and the subsequent 'stop_container'. * Finally, we now wait for the bootstrap host to report that it is in an OPERATIONAL state. Without this we can see errors where the MariaDB cluster is not ready when used by other services. Change-Id: Iaf7862be1affab390f811fc485fd0eb6879fd583 Closes-Bug: #1834467 (cherry picked from commit 86f373a1)
-
- Jul 04, 2019
-
-
Mark Goddard authored
During an upgrade, nova pins the version of RPC calls to the minimum seen across all services. This ensures that old services do not receive data they cannot handle. After the upgrade is complete, all nova services are supposed to be reloaded via SIGHUP to cause them to check again the RPC versions of services and use the new latest version which should now be supported by all running services. Due to a bug [1] in oslo.service, sending services SIGHUP is currently broken. We replaced the HUP with a restart for the nova_compute container for bug 1821362, but not other nova services. It seems we need to restart all nova services to allow the RPC version pin to be removed. Testing in a Queens to Rocky upgrade, we find the following in the logs: Automatically selected compute RPC version 5.0 from minimum service version 30 However, the service version in Rocky is 35. There is a second issue in that it takes some time for the upgraded services to update the nova services database table with their new version. We need to wait until all nova-compute services have done this before the restart is performed, otherwise the RPC version cap will remain in place. There is currently no interface in nova available for checking these versions [2], so as a workaround we use a configurable delay with a default duration of 30 seconds. Testing showed it takes about 10 seconds for the version to be updated, so this gives us some headroom. This change restarts all nova services after an upgrade, after a 30 second delay. [1] https://bugs.launchpad.net/oslo.service/+bug/1715374 [2] https://bugs.launchpad.net/nova/+bug/1833542 Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79 Closes-Bug: #1833069 Related-Bug: #1833542 (cherry picked from commit e6d2b922)
-
- Jun 30, 2019
-
-
Radosław Piliszek authored
The way Zuul handles job dependencies caused the computed path to setup_ceph_disks.sh to be wrong when job was run due to kolla (as opposed to kolla-ansible) change. This patch uses a relative path as is the case in Stein and later. Change-Id: I4f45cd18b86ebe0efcc3101e31e4bcafdc83eb57 Signed-off-by:
Radosław Piliszek <radoslaw.piliszek@gmail.com> (cherry picked from commit 1d22714c)
-
- Jun 28, 2019
-
-
Radosław Piliszek authored
In a rare event both kolla-ansible and nova-scheduler try to do the mapping at the same time and one of them fails. Since kolla-ansible runs host discovery on each deployment, there is no need to change the default of no periodic host discovery. I added some notes for future. They are not critical. I made the decision explicit in the comments. I changed the task name to satisfy recommendations. I removed the variable because it is not used (to avoid future doubts). Closes-Bug: #1832987 Change-Id: I3128472f028a2dbd7ace02abc179a9629ad74ceb Signed-off-by:
Radosław Piliszek <radoslaw.piliszek@gmail.com> (cherry picked from commit ce680bcf) (cherry picked from commit ed3d1e37)
-
- Jun 27, 2019
-
-
Radosław Piliszek authored
Due to zuul-cloner being finally removed from Zuul v3, we have to remove zuul-cloner usage from supported branches to use CI. Stein and up have it removed by an unrelated change. This simple patch is to go to rocky and below. See [1] for Zuul change. [1] https://review.opendev.org/663151 Change-Id: Id4ac65148daf09cb192789a1d73871811cdea342 Signed-off-by:
Radosław Piliszek <radoslaw.piliszek@gmail.com> (cherry picked from commit 924226df)
-
- Jun 24, 2019
- Jun 19, 2019
- Jun 18, 2019
-
-
Cody Hammock authored
If Blazar is enabled, ensure that fluentd processes its logs. Change-Id: If71d5c056c042667388dae8e4ee6d51a5ecab46e (cherry picked from commit 2c343562)
-
Jason authored
The Nova aggregate was always defaulting to some region (usually first in the Keystone endpoint list) when registering the Nova aggregate for Blazar. Add in a region override to ensure we are always writing to the region being deployed. Change-Id: I3f921ac51acab1b1020a459c07c755af7023e026 (cherry picked from commit f20cbf49)
-
Jason authored
When ansible goes in to a loop, by default it prints all the keys for the item it is looping over. Some roles, when setting up the databases, iterate over an object that includes the database password. Override the loop label to hide everything but the database name. Change-Id: I336a81a5ecd824ace7d40e9a35942a1c853554cd (cherry picked from commit 30c619d1)
-
Jason authored
In a multi-region environment, each region is being deployed separately. Cell discovery, however, would sometimes fail due to it picking a region different than the one being deployed. Most likely, an internal endpoint for region A will not be visible from region B. Furthermore, it is not very useful to discover hosts on a region you're not modifying. This changes the check to only run against nova compute services located in the region being deployed. Change-Id: I21eb1164c2f67098b81edbd5cc106472663b92cb (cherry picked from commit 328e1425)
-
Pierre Riteau authored
Kolla-Ansible populates /etc/hosts with overcloud hosts using their API interface IP address. When configured correctly, this allows Nova to use the API interface for live migration of instances between compute hosts. The hostname used is from the `ansible_hostname` variable, which is a short hostname generated by Ansible using the first dot as a delimiter. However, Nova defaults to use the result of socket.gethostname() to register nova-compute services. In deployments where hostnames are set to FQDNs, for example when using FreeIPA, nova-compute would try to reach the other compute node using its FQDN (as registered in the Nova database), which was absent from /etc/hosts. This can result in failures to live migrate instances if DNS entries don't match. This commit populates /etc/hosts with `ansible_nodename` (hostname as reported by the system) in addition to `ansible_hostname`, if they are different. Change-Id: Id058aa1db8d60c979680e6a41f7f3e1c39f98235 Closes-Bug: #1830023 (cherry picked from commit 37899026)
-
Mark Goddard authored
backport: stein, rocky During startup of nova-compute, we see the following error message: Error gathering result from cell 00000000-0000-0000-0000-000000000000: DBNotAllowed: nova-compute This issue was observed in devstack [1], and fixed [2] by removing database configuration from the compute service. This change takes the same approach, removing DB config from nova.conf in the nova-compute* containers. [1] https://bugs.launchpad.net/devstack/+bug/1812398 [2] https://opendev.org/openstack/devstack/commit/82537871376afe98a286e1ba424cf192ae60869a Change-Id: I18c99ff4213ce456868e64eab63a4257910b9b8e Closes-Bug: #1829705 (cherry picked from commit 002eec95)
-
Mark Goddard authored
Right now every controller rotates fernet keys. This is nice because should any controller die, we know the remaining ones will rotate the keys. However, we are currently over-rotating the keys. When we over rotate keys, we get logs like this: This is not a recognized Fernet token <token> TokenNotFound Most clients can recover and get a new token, but some clients (like Nova passing tokens to other services) can't do that because it doesn't have the password to regenerate a new token. With three controllers, in crontab in keystone-fernet we see the once a day correctly staggered across the three controllers: ssh ctrl1 sudo cat /etc/kolla/keystone-fernet/crontab 0 0 * * * /usr/bin/fernet-rotate.sh ssh ctrl2 sudo cat /etc/kolla/keystone-fernet/crontab 0 8 * * * /usr/bin/fernet-rotate.sh ssh ctrl3 sudo cat /etc/kolla/keystone-fernet/crontab 0 16 * * * /usr/bin/fernet-rotate.sh Currently with three controllers we have this keystone config: [token] expiration = 86400 (although, keystone default is one hour) allow_expired_window = 172800 (this is the keystone default) [fernet_tokens] max_active_keys = 4 Currently, kolla-ansible configures key rotation according to the following: rotation_interval = token_expiration / num_hosts This means we rotate keys more quickly the more hosts we have, which doesn't make much sense. Keystone docs state: max_active_keys = ((token_expiration + allow_expired_window) / rotation_interval) + 2 For details see: https://docs.openstack.org/keystone/stein/admin/fernet-token-faq.html Rotation is based on pushing out a staging key, so should any server start using that key, other servers will consider that valid. Then each server in turn starts using the staging key, each in term demoting the existing primary key to a secondary key. Eventually you prune the secondary keys when there is no token in the wild that would need to be decrypted using that key. So this all makes sense. This change adds new variables for fernet_token_allow_expired_window and fernet_key_rotation_interval, so that we can correctly calculate the correct number of active keys. We now set the default rotation interval so as to minimise the number of active keys to 3 - one primary, one secondary, one buffer. This change also fixes the fernet cron job generator, which was broken in the following cases: * requesting an interval of more than 1 day resulted in no jobs * requesting an interval of more than 60 minutes, unless an exact multiple of 60 minutes, resulted in no jobs It should now be possible to request any interval up to a week divided by the number of hosts. Change-Id: I10c82dc5f83653beb60ddb86d558c5602153341a Closes-Bug: #1809469 (cherry picked from commit 6c1442c3)
-
Mark Goddard authored
Before making changes to this script, document its behaviour with a unit test. There are two major issues: * requesting an interval of more than 1 day results in no jobs * requesting an interval of more than 60 minutes, unless an exact multiple of 60 minutes, results in no jobs Change-Id: I655da1102dfb4ca12437b7db0b79c9a61568f79e Related-Bug: #1809469 (cherry picked from commit 25ac955a)
-
- Jun 13, 2019
-
-
Pierre Riteau authored
Check if a base Nova cell already exists before calling `nova-manage cell_v2 create_cell`, which would otherwise create a duplicate cell when the transport URL or database connection change. If a base cell already exists but the connection values have changed, we now call `nova-manage cell_v2 update_cell` instead. This is only possible if a duplicate cell has not yet been created. If one already exists, we print a warning inviting the operator to perform a manual cleanup. We don't use a hard fail to avoid an abrupt change of behavior if this is backported to stable branches. Change-Id: I7841ce0cff08e315fd7761d84e1e681b1a00d43e Closes-Bug: #1734872 (cherry picked from commit 19b8dbe4)
-
- May 31, 2019
-