How to fix NTP issues on Nutanix CVMs

First, research the following article: https://portal.nutanix.com/#/page/kbs/details?targetId=kA032000000bmjeCAA

But if you suspect an offset anyway, run the following. From this example, the time is way off.

allssh grep offset ~/data/logs/genesis.out
================== 173.23.33.12 =================
2018-06-20 08:54:07 INFO time_manager.py:555 NTP offset: -119.154 seconds
2018-06-20 09:04:43 INFO time_manager.py:555 NTP offset: -119.149 seconds
2018-06-20 09:15:14 INFO time_manager.py:555 NTP offset: -119.154 seconds
2018-06-20 09:25:50 INFO time_manager.py:555 NTP offset: -119.145 seconds
2018-06-20 09:36:21 INFO time_manager.py:555 NTP offset: -119.154 seconds

Cassandra will not allow for the server to immediately switch back to another time because of the large offset and messing up with timestamps but fear not, it comes with a script to slowly catch up “Fix Time Drift”

allssh '(/usr/bin/crontab -l && echo "*/5 * * * * bash -lc /home/nutanix/serviceability/bin/fix_time_drift") | /usr/bin/crontab -'

Then you can keep an eye on the cluster time offset using:

for i in `svmips` ; do echo CVM $i: ; ssh $i "/usr/sbin/ntpq -pn" ; echo ; done
CVM 173.23.33..12:
FIPS mode initialized
Nutanix Controller VM
remote refid st t when poll reach delay offset jitter
==============================================================================
*173.23.33..1 13.65.245.138 3 u 23 256 377 0.513 23.415 21.630
127.127.1.0 .LOCL. 10 l 107m 64 0 0.000 0.000 0.000

CVM 173.23.33..14:
FIPS mode initialized
Nutanix Controller VM
remote refid st t when poll reach delay offset jitter
==============================================================================
*173.23.33.12 173.23.33..1 4 u 135 256 377 0.226 6.950 6.836

CVM 173.23.33..16:
FIPS mode initialized
Nutanix Controller VM
remote refid st t when poll reach delay offset jitter
==============================================================================
*173.23.33.12 173.23.33..1 4 u 30 256 377 0.240 -2.570 6.010
<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

or with the previously mentioned command:

allssh grep offset ~/data/logs/genesis.out
================== 173.23.33.12 =================
2018-06-21 09:41:16 INFO time_manager.py:555 NTP offset: -118.121 seconds
2018-06-21 09:40:22 INFO time_manager.py:555 NTP offset: -0.000 seconds
2018-06-21 09:50:52 INFO time_manager.py:555 NTP offset: 0.005 seconds
2018-06-21 10:01:22 INFO time_manager.py:555 NTP offset: 0.006 seconds

when all caught up, run the ntp health check:

ncc health_checks network_checks check_ntp

Also, after all is clear don’t forgot to remove the fix_time_drift crontab job!

allssh "(/usr/bin/crontab -l | sed '/fix_time_drift/d' | /usr/bin/crontab -)"
Advertisements

Azure Cloud, IOPS, DTU and VM equivalence, PaaS vs IaaS MSSQL

I am building the next generation of Data Warehousing for a client and because the cloud’s sky is the limit it doesn’t mean they can afford to spend. Millions (25 million) of records need to be analyzed to produce meaningful reports and dashboard.

Cranking the power all the way up gives us the information we need in seconds. The highest database tier offered is something call P11. But what behind P11? Did I say this PaaS option was expensive?

First, it is not readily translatable into regular machines as the various tiers use a mix of computing, memory and disk IOPS and as per Microsoft: “a blended measure of CPU, memory, and data I/O and transaction log I/O…”. This blend unit is the Database Transaction Unit aka DTU.

Based on the performance trending it comes to something like this (courtesy of What the heck is a DTU?):

Number Cores IOPS Memory DTUs Service Tier Comparable Azure VM Size
1 core, 5% utilization 10 ??? 5 Basic Standard_A0, barely used
<1 core 150 ??? 100 Standard S0-S3 Standard_A0, not fully utilized
1 core up to 4000 ??? 500 Premium – P4 Standard_DS1_v2
2-3 cores up to 12000 ??? 1000 Premium – P6 Standard_DS3_v2
4-5 cores up to 20000 ??? 1750 Premium – P11 Standard_DS4_v2

Notice that this is a comparable size…and that P11 DTU is about CAD$15,000/month!!! And while PaaS is great, the price tag is a little too much for the common mortals getting into DWH.

Anyhow, trying to recreate a cheaper P11 I went on a picked a VM size of E8s_v3, because it said  “Great for relational database servers, medium to large caches, and in-memory analytics”. More sizes here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes 

Size vCPU Memory: GiB Temp storage (SSD) GiB Max data disks Max cached and temp storage throughput: IOPS / MBps (cache size in GiB) Max uncached disk throughput: IOPS / MBps Max NICs / Expected network bandwidth (Mbps)
Standard_E8s_v3 8 64 128 16 16,000128 (200) 12,800 / 192 4 / 4,000

And so to host my data, I am adding the best disk I find on Azure:

This one, because it is closer to the size I need.
P20 disk is 512 GB, it can achieves 2300 IOPS and up to 150 MB/s.

When I could have gone for the highest:
P30 disk is 1024 GB, it can achieves 5000 IOPS and up to 200 MB/s. And only comes in 1TB size

The key thing* to understand is that the VM tier will limit the IOPS of what you connect to it.

And so here, while my disk is rated at 2300 IOPS and 150 MB/s, the machine specifications are going to limit me at 16,000 IOPS (no problem here) / but only 128 MB/s which is ok because P20 is only 150 MB/s anyway.

Once the VM is brought up online and the disk attached, a quick benchmark will match the specifications given:

iaas iops

 

 

135 MB/s for 150 MB/s advertised…

But wait, is this fast? My laptop gives me way more

laptop iops.png

How to attach faster disks to my VMs then?  What if I want to create a VHD with 600 GB and 400 MB/s of throughput?

You will not obtain such throughput if you just create a 600 GB VHD, because Azure will create a 600 GB VHD on a P30 Disk, and then you will have only 200 MB/s.

To achieve that, you should use stripping, and to do that, you can proceed with different ways:

  • Create 2 600 GB VHDs. Azure will create them using P30 disks. Then you use your stripping tool (Storage spaces) to create a 1200 GB volume. This volume will permit 400 MB/s and 10000 IOPS. But in this case, you will have 600 un-needed GB
  • Create 3 VHDs with 200 GB each. Azure will create them using P20 disks. Then you use your stripping tool (Storage spaces) to create a 600 GB volume. This volume will permit 450 MB/s (150 MB/s *3) and 6900 IOPS (2300 IOPS *3).

Wait, I need more! What if I want to create a VHD with 600 GB and 600 MB/s of throughput? Unfortunately, we can’t just dream, and ask Azure to do it, not till today. In fact, the maximum throughput possible is 512 MB/s, we can’t do better.

Ok. You have created the stripped volumes and you are still not getting what you want? Remember the key thing I mentioned above * ? The total data storage, the IOPS and the throughput are limited by the VM series and size. Each Azure Virtual Machine type is limited by a number of disks (total storage size), a maximum IOPS (IOPS) and a maximum throughput (Throughput). For example, you  may achieve a 128 MB/s only in a Standard_E8s_v3.  All the other VM types will throttle your IOPS or throughput when you reach the threshold.

When I was looking at the Memory optimized server size, there are Storage optimized VM sizes.

Size vCPU Memory: GiB Temp storage (SSD) GiB Max data disks Max cached and temp storage throughput: IOPS / MBps (cache size in GiB) Max uncached disk throughput: IOPS / MBps Max NICs / Expected network bandwidth (Mbps)
Standard_E8s_v3 8 64 128 16 16,000128 (200) 12,800 / 192 4 / 4,000
Standard_L8s 8 64 32 32 40,000 / 400 10,000 / 250 4 / 8,000

And so from the disk creation steps above, you will also want to get a VM that will be able to provide you with the throughput you need.

#azure, #database, #dtu, #iaas, #mssql, #paas, #sql

IBM v7000 WebUI not responding

I ran into the case of non-responding WebUI to manage LUNs (and other things) on a IBM Storwize v7000.
You can only run satask commands if you connected to the CLI using the SSH private key which is associated with the user called superuser. No other SSH key will allow you to run satask commands:

IBM_Storwize:SAN1:superuser>sainfo lsservicenodes
panel_name cluster_id cluster_name node_id node_name relation node_status error_data
01-1 00000100204029A2 SAN1 1 node1 local Active
01-2 00000100204029A2 SAN1 2 node2 partner Active

To find out which node is the configuration node, run the following command

sainfo lsservicestatus <Active panel name>

the following line tells you which node is the config node:

IBM_Storwize:SAN1:superuser>sainfo lsservicestatus 01-1
panel_name 01-1
cluster_id 00000100204029a2
cluster_name SAN1
cluster_status Active
cluster_ip_count 2
cluster_port 1
cluster_ip 172.xx.xx.33
cluster_gw 172.xx.xx.252
cluster_mask 255.255.254.0

Use the following command to restart the web service on the configuration node

satask restartservice -service tomcat 01-1

Wait at least 5 minutes for the service to restart before assuming that this has failed

If this do no’t solve your problems, you should restart the cannister with configuration node on it. Remember to connect to the Node that’s not configuration node. And restart the “partner”. If unable to get it reset, reset the canister physically. that is, pull it out a few inches from the chassis and then insert it again after 30 seconds.

#ibm, #storage-2, #v7000, #webui

After a migration to Exchange Online Exchange Autodiscover SCP got in the way

There was once a customer who was on Exchange 2013 on-prem for their email needs, at some point they decided (thanks to Microsoft enticing pricing) to go with Office 365 and migrate all of their mailboxes and email needs to the cloud.

After the Exchange servers were put out of the equation, users and new outlook set up started seeing some uninteresting error/warning messages:

Troubleshooting Certificate Mismatch Warnings in Outlook ...

And so, while moving to Office 365, all of the DNS entries (internal and external) for autodiscover had been changed to autodiscover.outlook.com but the old one: autodiscover.contoso.com – in theory, the on-prem exchange server CAS – would still be around and generate the message. So where would this on-prem reference come from?

While on-prem, Exchange would have been configured to add entries to Active Directory and here in the name of a Service Connection Point (SCP) object and as the domain joined machine want to auto-configure Outlook, they would end up finding of that older reference.

Autodiscover functional process

The solution is to remove this from AD! There are multiple ways to prevent Outlook from contacting the local Exchange server first…

Using Exchange Management Shell (EMS)
The preferred way is to use the Exchange Management Shell to clear the entry for the Client Access server from the SCP.

[PS] C:\Windows\system32>Get-ClientAccessServer | fl *uri*
AutoDiscoverServiceInternalUri : https://webmail.contoso.com/autodiscover/autodiscover.xml
AutoDiscoverServiceInternalUri : https://webmail.contoso.com/autodiscover/autodiscover.xml
[PS] C:\Windows\system32>Set-ClientAccessServer -Identity cas-ex1 -AutoDiscoverServiceInternalUri $null
[PS] C:\Windows\system32>Set-ClientAccessServer -Identity cas-ex2 -AutoDiscoverServiceInternalUri $null

ADSIEdit
If the above method can no longer be used a low-level AD editor as EDSIEdit can be used to remove the SCP manually. The full path of the SCP is:

CN=ServerName,CN=Autodiscover,CN=Protocols,CN=ServerName,CN=Servers,CN=Exchange Administrative Group (FGH124FG788IYF),CN=Administrative Groups,CN=OrganizationName,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=DomainName,DC=Suffix

image

This object to remove has the Class type serviceConnectionPoint.

Some useful tools to help out finding out any autodiscover issues:
SARA (microsoft Support And Recovery Assistant): https://diagnostics.office.com
Some reference from technet (yes I still call it technet): https://technet.microsoft.com/en-us/library/bb124251.aspx
More about SCPs:https://msdn.microsoft.com/en-us/library/office/dn467397(v=exchg.150).aspx

#exchange, #office-365, #outlook

Netflix and opendns are not friends – netflix support sucks.

For some reason I had not opted to let the Netflix app on my Android TV and one day with an Android update it decided it was going to be using 4.0.4 build 1716 instead.

Then Netflix would not be able to load any video with somewhat getting stuck at 25% with the lovely error of tvq-pm-100 3.1-52.

Life after work ended. The evening entertainment was ruined for ever and I was not the same again. The downwards spirals was inevitable.

I chatted with Netflix, spends hours on the phone going through the meaningless scripted troubleshoot – had I restart my TV box? Log off and back on? clear the cache? reset the appliance? nothing I was on the verge of video deprivation.

The most intriguing aspect was the competent Netflix staff would say: well as it is not us, it must your network provider. Yet not able to say what getting stuck at 25% could mean. Where are the good old logs telling what is going on when you need them?

I then read on a forum that the Netflix Android TV app would rely on Google DNS to geo-triangulate you and spy on you.

In order to protect my household I had opted long ago for opendns to block the doubleclick and other webspam of the universe without issues in the previous versions of Netflix.

In the end, changing the DNS setting on that Android TV to use Google’s infamous DNS 8.8.8.8 and 8.8.4.4 to see Netflix videos loading at lightning speed and that very same Android TV box could again spy on me at will.

Thanks to Google’s sneakiness the end of the world was avoided.

#google, #netflix, #netflix-sucks, #opendns, #privacy, #support-sucks

Copy windows features from a server to another

Sometimes you want to create the (almost) same server where you do not yet Chef or CF or some sort of DSC. The best resort is to use what you have: get-windowsfeature

Imagine you want to configure Server B from Server A and obviously those are microsoft windows servers…

#On server A
#export features
> Get-WindowsFeature | ? { $_.Installed -AND $_.SubFeatures.Count -eq 0 } | Export-Clixml .\serverA.xml
#copy the feature file over
> cp .\serverA.xml ‘\\serverB\c$\Files’

#On server B
PS C:\Files> ls
Directory: C:\Files
Mode LastWriteTime Length Name
—- ————- —— —-
-a—- 6/8/2017 4:15 PM 510824 ServerA.xml
PS C:\Files> Import-Module Servermanager
PS C:\Files> Import-Clixml .\ServerA.xml | Add-WindowsFeature
Success Restart Needed Exit Code Feature Result
——- ————– ——— ————–
True Yes SuccessRest… {Application Server, .NET Framework 4.5, W…
WARNING: You must restart this server to finish the installation process.

Voila, another posh timesaver.

#clone, #get-windowsfeature, #powershell, #server, #windows-2

Nutanix AOS 5.1 & Companions are now GA

For the second time this year, Nutanix has released a major feature upgrade to AOS and companion software. Now available, is AOS 5.1! Top of the list of new features is vSphere 6.5 support for NX platforms (Nutanix branded gear). vSphere 6.5 support for OEM platforms is coming soon. But that’s not the only new feature. Here’s a rundown of some (not all) of the new features:

  • 1-click controller VM (CVM) memory upgrade
  • XenServer support on NX-1065-G5, NX-3060-G5, NX-3175-G5 (optionally with NVIDIA M60)
  • All-flash clusters now support adding hybrid nodes (e.g. cold storage only nodes). Minimum 2 AF nodes.
  • Automatic “admin” account password sync across all CVMs, Prism Web console, and SSH interfaces.
  • Docker container management through self-service portal.
  • Prism 1-click feature to install Docker host VM
  • Post-process compression is enabled by default on all new containers with Pro and Ultimate licenses
  • 1-click centralized upgrades from Prism Central
  • 1-click Prism central cluster registration and Prism Central Deployment
  • Pulse (telemetry) enabled for Prism Central
  • Auto-resolved alerts
  • User defined alerts
  • Graphics and compute mode for NVIDIA M60 GPU
  • CHAP authentication for Acropolis Block Services
  • Hot-plug CPU and memory on AHV VMs
  • Metro availability and synchronous replication supported across hardware vendors (NX, Dell, Lenovo). Async support continues.
  • VirtIO drivers updated to v1.1
  • Dynamically increase EC-X strip size as cluster is expanded
  • Much improved storage efficiency reporting in Prism (compression, dedupe, EC-X, etc.)
  • Disk rebuild time estimation
  • AFS supports Mac OS v10.10, v10.11, v10.12
  • Acropolis Block Service enhanced OS support (Solaris 11, RHEL 6, 7, 6.8)

Tech Preview Features include:

  • Software only support for UCS B-series blades
  • GPU pass-through for AHV guest VMs
  • Support 3rd-party network function VMs (e.g. load balancer, firewall, etc.) routed through Open vSwitch (OVS).

Companion Software Updates

  • Prism Central 5.1
  • Acropolis File Services (AFS) 2.1
  • Acropolis Container Services (ACS) 1.0
  • Foundation 3.7.2

Helpful Links

As of 5/1/2017, AOS 5.1 has not been enabled for automatic download and 1-click upgrades. As always, if you don’t want to wait for the automatic download switch to be flipped (in the near future), you can grab the AOS binary from the support portal and use our 1-click upgrade process. As always, thoroughly read the full release notes on the support portal before attempting an upgrade.

#acropolis, #aos, #nutanix, #virtual