Sorry, you need to enable JavaScript to visit this website.

Feedback

Your feedback is important to keep improving our website and offer you a more reliable experience.
Home / OpenStack* on Intel® Architecture / Blogs / Pkoniszewski / 2016 / OpenStack* on Intel® Architecture - OSSC zero-day patching
Blog

OSSC zero-day patching

Author: 
Pawel Koniszewski

Written by Kamil Szczygiel

There is a wide range of opinions on the current state of live migration capabilities with OpenStack. By establishing objective goals and using repeatable automated tests, we intend to determine the proper configuration options and uncover any technical issues in the entire stack so we can continually improve the solutions results. The goal is to ensure that live migration with OpenStack is highly reliable, efficient, fast, and automated.

Testing methodology

Our working assumption was that we need to apply a zero day system patch to all compute nodes as fast as possible. To simulate that, we do the following steps:

  1. First we disable particular nova-compute services to make sure that virtual machines won’t be migrated to currently patched compute nodes (if patching more than one server at a time).
  2. After that, we start to live migrate all of the virtual machines out of the compute nodes.
  3. Then we force a system reboot of servers without performing any system modifications.
  4. When servers and nova-compute services are back online, we reenable the nova-compute services and perform the same set of actions on the remaining compute nodes.

Configuration of the environment:

  • 100 compute nodes
    • Two (2) Intel® Xeon® E5-2680 v3 processors (12 cores)
    • 256 GB RAM (16 x 16 GB 2133 MHz)
    • Two (2) S3610 800 GB in RAID 1, write back, adaptive read ahead
    • 9000 MTU on all interfaces
    • Four (4) 10G interfaces bonded with layer3+4 transmit hash policy
  • 54 storage nodes
    • Two (2) Intel® Xeon® E5-2670 v3 processors (12 cores)
    • 128 GB RAM (8 x 16 GB 2133 MHz)
    • Two (2) S3610 400 GB as journals, RAID 0 for each disk, write through, no adaptive read ahead
    • Ten (10) 2 TB HDDs as OSDs, RAID 0 for each disk, write back, adaptive read ahead
    • 9000 MTU on all interfaces
    • Four (4) 10 G interfaces bonded with layer3+4 transmit hash policy
  • Ceph Jewel* with replication factor 3 and no RBD cache
  • Three (3) controllers (each has OpenStack controller, network node and Ceph monitor role)
  • OpenStack Mitaka
  • Ubuntu* 14.04 on all nodes
  • QEMU 2.0.0
  • Libvirt 1.2.2
  • 40 instances on each compute (each instance has 2 vCPU, 7.5 GB RAM, 50 GB disk)
    • CentOS* 7
    • qcow2 image type - booted from image to volume
    • Eight (8) Cassandra* instances (Seed placement strategy: SimpleStrategy, Replication factor: 3, Write/Read/Scan/Delete consistencylevel: Quorum, no vCPU utilization assumptions, 50% average memory utilization)
    • Eight (8) YCSB instances (Stress configuration 2000000 records, 1 thread)
    • Eight (8) Magento* instances (30% average vCPU utilization)
    • Eight (8) idle instances (no RAM/vCPU utilization assumptions)
    • Eight (8) Mediawiki instances (30% average vCPU utilization)

Test results:

Scenario Configuration (T) Configuration (AC) Configuration (S) Total patching time Average VM migration time Average evacuation time Average host reboot time Average host patching time
1CN1LM - - 33h 7m 24s 17m 37s 2m 15s 19m 52s
1CN1LM Failed* N/A N/A N/A N/A
1CN3LM - - 27h 15m 59s 14m 5s 2m 14s 16m 21s
1CN3LM Failed* N/A N/A N/A N/A
3CN1LM - - 14h 7m 27s 6m 43s 2m 27s 7m 32s
3CN1LM - Failed* N/A N/A N/A N/A
3CN1LM Failed* N/A N/A N/A N/A
3CN3LM - - 10h 45m 1m 18s 5m 43s 2m 20s 6m 29s
3CN3LM Failed* N/A N/A N/A N/A
1CN1LM - 17h 4m 11s 8m 22s 2m 14s 10m 36s
1CN1LM - - 17h 48m 11s 7m 55s 2m 46s 10m 41s
1CN3LM - 8h 38m 11s 2m 52s 2m 19s 5m 11s
1CN3LM - - 9h 20m 11s 3m 5s 2m 21s 5m 26s
3CN1LM - 5h 46m 11s 7m 55s 2m 16s 10m 11s
3CN1LM - - 5h 40m 11s 7m 45s 2m 17s 10m 2s
3CN3LM - 2h 59m** 11s 2m 59s 2m 17s 5m 16s
3CN3LM - - 2h 55m** 11s 2m 51s 2m 18s 5m 9s

* Timeout on live migration progress after 15 minutes

** RabbitMQ* connection failures

Legend:

  • CN – number of compute nodes being patched at once
  • LM – number of concurrent live migrations per compute node
  • S – stress being applied to the instances
  • T - libvirt tunnelled transport enabled
  • AC – libvirt auto converge feature enabled

Conclusions:

  • Disabling tunneling allowed stressed virtual machines to migrate.
  • Network throughput on live migration interface was ~10 Gbit/s.
  • Disabling tunneling has a security drawback: tunneled data will be not encrypted.
  • One of the ideas is to use a dedicated, encrypted live migration network based on IPv6.
  • Lowest patching time with all virtual machines successfully migrated was five hours and 40 minutes (three compute nodes at once and one concurrent live migration).
  • While tunneling was disabled auto converge was never triggered due to very good progress in the live migration process.
  • QEMU throttles down CPU only when there is no progress during live migration. Because of that live migration duration with auto converge enabled and disabled is nearly equal.
  • Test with stressed instances and tunneling enabled caused live migration to fail.
  • Virtual machines changed memory pages faster than the host could transfer them to the destination host.
  • Enabling the auto converge feature did not improve the live migration process.
  • Network throughput on live migration interface during live migration was < 2 Gbit/s (due to tunneling).
  • Attempt to live migrate the instance timed out after 15 minutes.
  • Patching three hosts at once with three concurrent live migrations (nine concurrent live migrations in total) caused RabbitMQ connection failures. Because of that, not all live migrations were successful.
  • Connection failures during live migration resulted in “ghost” virtual machines: this occurred when the virtual machine was running on a different compute node than reported by nova. This caused a resources usage mismatch between reality and nova, and there was no ability to live migrate the virtual machine to the correct compute node because the virtual machine disk already exists on a target node.

Nova.conf

inject_partition = -2
inject_password = False
inject_key = False
use_usb_tablet = False
use_virtio_for_bridges = True
cpu_mode = host-model
virt_type = kvm
remove_unused_resized_minimum_age_seconds = 3600
live_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED
live_migration_uri = "qemu+ssh://nova@%s/system?no_verify=1&keyfile=/var/lib/nova/.ssh/id_rsa"
osapi_max_limit = 5000

Libvirt.conf

listen_tls = 0
listen_tcp = 1
unix_sock_group = "libvirtd"
unix_sock_ro_perms = "0777"
unix_sock_rw_perms = "0770"
auth_unix_ro = "none"
auth_unix_rw = "none"
auth_tcp = "sasl"

System tuning

  • fs.inotify.max_user_watches = 36864
  • net.ipv4.conf.all.rp_filter = 0
  • net.ipv4.conf.default.rp_filter = 0
  • net.ipv4.ip_forward = 1
  • net.netfilter.nf_conntrack_max = 262144
  • vm.dirty_background_ratio = 5
  • vm.dirty_ratio = 10
  • vm.swappiness = 5
  • net.bridge.bridge-nf-call-ip6tables = 0
  • net.bridge.bridge-nf-call-iptables = 0
  • net.bridge.bridge-nf-call-arptables = 0
  • net.ipv4.neigh.default.gc_thresh1 = 4096
  • net.ipv4.neigh.default.gc_thresh2 = 8192
  • net.ipv4.neigh.default.gc_thresh3 = 16384
  • net.ipv4.route.gc_thresh = 16384
  • net.ipv4.neigh.default.gc_interval = 60
  • net.ipv4.neigh.default.gc_stale_time = 120
  • net.ipv6.neigh.default.gc_thresh1 = 4096
  • net.ipv6.neigh.default.gc_thresh2 = 8192
  • net.ipv6.neigh.default.gc_thresh3 = 16384
  • net.ipv6.route.gc_thresh = 16384
  • net.ipv6.neigh.default.gc_interval = 60
  • net.ipv6.neigh.default.gc_stale_time = 120
  • fs.aio-max-nr = 131072

Software and workloads used in performance tests might have been optimized for performance only on Intel® architecture.

Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors might cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of each product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Archive

Further Reading

Author: 
Melissa Evers-Hood

The continued evolution of software-defined Cloud infrastructure and the explosion of data being created at the edge of the network are driving a wave of new uses cases along with innovative projects and technologies. The breadth of potential edge locations and landing zones makes it critical that we simplify the deployment and management of hardware and software infrastructure supporting edge computing.

Author: 
Travis Broughton

OpenStack* continues to achieve rapid growth in China, with companies seeing great recognition in the community for their contributions as well as their large-scale deployments.  China Railway is n