Azure SQL Managed Instance setup and performance thoughts

I am not going to describe what SQLMI are and how they compare to the other SQL offering on Azure here, but if you need to know more about MI this is a great starting place https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance. Yet, this is a very viable and cool option to host databases in Azure.

Also, as you are looking to test this out and If you don’t want to integrate the MI to existing vnet, you can look at this quick start template https://github.com/Azure/azure-quickstart-templates/tree/master/101-sql-managed-instance-azure-environment

 

Getting  a SQL MI ready

This said, getting the MI installed is not like any other resources or PaaS in that instance, and you will need:

  1. Configure Virtual Network where Managed Instance will be placed.
  2. Create Route table that will enable Managed Instance to communicate with Azure Management Service.
  3. Optionally create a dedicated subnet for Managed Instance (or use default one that is created when the Virtual Networks is created)
  4. Assign the Route table to the subnet.
  5. Double-check that you have not added something that might cause the problem.

 

Vnet configuration is the network container for all your stuff, this said, the MI shall not be on a subnet that has anything else. And so, creating a subnet for all of your Mis makes sense. Also, there cannot be any Service Endpoints attached to the subnet either. If you want to have only one subnet in your Virtual Network (Virtual Network blade will enable you to define first subnet called default), you need to know that Managed Instance subnet can have between 16 and 256 addresses. Therefore, use subnet masks /28 to /24 when defining your subnet IP ranges for default subnet. If you know how many instances you will have make sure that you have at least 2 addresses per instance + 5 system addresses in the default subnet.

 

The route table will allow the MI to talk to the Azure Management Service. This is required because Managed Instance is placed in your private Virtual Network, and if it cannot communicate with Azure service that manages it will become inaccessible. add new resource “Route table”, and once it is created for to Routes blade and add a route “0.0.0.0/0 Next Hop Internet route”. This route will enable Managed Instances that are placed in your Virtual Network to communicate to Azure Management Service that manages the instance. Without this, the Managed Instance cannot be deployed.

 

Rules for the subnet

  • You have a Managed Instance Route table assigned to the subnet
  • There should be no Networks Security Groups in your subnet.
  • There should be no service-endpoint in your subnet.
  • There are no other resources in subnet.

 

Altogether:

  • Virtual Network should have Service Endpoints disabled
  • Subnet must have between 16 and 256 IP addresses (masks from /28 to /24)
  • There should be no other resources in your Managed Instance subnet.
  • Subnet must have a route with 0.0.0.0/0 Next hop internet
  • Subnet must not have any Network Security Group
  • Subnet must not have any service endpoint

 

More on configuring your SQLMI on MSDN (or whatever the new name is:) ) https://blogs.msdn.microsoft.com/sqlserverstorageengine/2018/03/14/how-to-configure-network-for-azure-sql-managed-instance/

 

Access and restores

And then what you ask? You need to connect and play for the database stuff?

SQLMI are private by default and the easy way is to connect from a VM within the same Vnet, connect using SMSS or from your app running next to the SQLMI, like the usual architecture stuff right?

But wait, there is more scenarios here! https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance-connect-app

 

sqlmi1

 

Don’t be fooled by the version number here, this is not your regular v12 MSSQL (aka 2014). Those MSSQL azure flavours follow a different numbering and are actually more like a mixture between 2017 and 2019 (As of today!)

But if you thought you could restore a DB from a backup (.bak) file you can, but it will have to be from a BLOB storage container of some sort as SQLMI can only understand that device type.

Or use DMS https://docs.microsoft.com/en-us/azure/dms/tutorial-sql-server-to-managed-instance

 

Performance

Because SQLMI is managed and under SLA, all databases are fully logged with a throttling(https://blogs.msdn.microsoft.com/sqlcat/2018/07/20/storage-performance-best-practices-and-considerations-for-azure-sql-db-managed-instance-general-purpose/ ) to ensure the cluster can ship and replay the logs in a good amount of time. When doing more intensive operations such as insert millions of records this can play a role slowing this down.

 

The answer to that is: over-allocated storage! Indeed, behind the DB files are blobs with various iops capabilities and just like the managed disks, your disk iops for your DBs come with the amount of storage too. The tiers are >512; 513-1024; 1024-2048… so even if you have a small DB, you might want to go with more space on the instance and grow your DB (and log) files to the maximum right away – you pay for it after all!. More on premium disks here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/premium-storage#scalability-and-performance-targets

 

Tip! remember that the tempDB is not fully logged and lives on local SSD for the Business Critical version of SQLMI. Use this one if want untethered speed.

Also of interest a script to measure iops https://github.com/dimitri-furman/managed-instance/blob/master/MI-GP-storage-perf/MI-GP-storage-perf.sql

 

Advertisements

How to fix NTP issues on Nutanix CVMs

First, research the following article: https://portal.nutanix.com/#/page/kbs/details?targetId=kA032000000bmjeCAA

But if you suspect an offset anyway, run the following. From this example, the time is way off.

allssh grep offset ~/data/logs/genesis.out
================== 173.23.33.12 =================
2018-06-20 08:54:07 INFO time_manager.py:555 NTP offset: -119.154 seconds
2018-06-20 09:04:43 INFO time_manager.py:555 NTP offset: -119.149 seconds
2018-06-20 09:15:14 INFO time_manager.py:555 NTP offset: -119.154 seconds
2018-06-20 09:25:50 INFO time_manager.py:555 NTP offset: -119.145 seconds
2018-06-20 09:36:21 INFO time_manager.py:555 NTP offset: -119.154 seconds

Cassandra will not allow for the server to immediately switch back to another time because of the large offset and messing up with timestamps but fear not, it comes with a script to slowly catch up “Fix Time Drift”

allssh '(/usr/bin/crontab -l && echo "*/5 * * * * bash -lc /home/nutanix/serviceability/bin/fix_time_drift") | /usr/bin/crontab -'

Then you can keep an eye on the cluster time offset using:

for i in `svmips` ; do echo CVM $i: ; ssh $i "/usr/sbin/ntpq -pn" ; echo ; done
CVM 173.23.33..12:
FIPS mode initialized
Nutanix Controller VM
remote refid st t when poll reach delay offset jitter
==============================================================================
*173.23.33..1 13.65.245.138 3 u 23 256 377 0.513 23.415 21.630
127.127.1.0 .LOCL. 10 l 107m 64 0 0.000 0.000 0.000

CVM 173.23.33..14:
FIPS mode initialized
Nutanix Controller VM
remote refid st t when poll reach delay offset jitter
==============================================================================
*173.23.33.12 173.23.33..1 4 u 135 256 377 0.226 6.950 6.836

CVM 173.23.33..16:
FIPS mode initialized
Nutanix Controller VM
remote refid st t when poll reach delay offset jitter
==============================================================================
*173.23.33.12 173.23.33..1 4 u 30 256 377 0.240 -2.570 6.010
<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>

or with the previously mentioned command:

allssh grep offset ~/data/logs/genesis.out
================== 173.23.33.12 =================
2018-06-21 09:41:16 INFO time_manager.py:555 NTP offset: -118.121 seconds
2018-06-21 09:40:22 INFO time_manager.py:555 NTP offset: -0.000 seconds
2018-06-21 09:50:52 INFO time_manager.py:555 NTP offset: 0.005 seconds
2018-06-21 10:01:22 INFO time_manager.py:555 NTP offset: 0.006 seconds

when all caught up, run the ntp health check:

ncc health_checks network_checks check_ntp

Also, after all is clear don’t forgot to remove the fix_time_drift crontab job!

allssh "(/usr/bin/crontab -l | sed '/fix_time_drift/d' | /usr/bin/crontab -)"

Azure Cloud, IOPS, DTU and VM equivalence, PaaS vs IaaS MSSQL

I am building the next generation of Data Warehousing for a client and because the cloud’s sky is the limit it doesn’t mean they can afford to spend. Millions (25 million) of records need to be analyzed to produce meaningful reports and dashboard.

Cranking the power all the way up gives us the information we need in seconds. The highest database tier offered is something call P11. But what behind P11? Did I say this PaaS option was expensive?

First, it is not readily translatable into regular machines as the various tiers use a mix of computing, memory and disk IOPS and as per Microsoft: “a blended measure of CPU, memory, and data I/O and transaction log I/O…”. This blend unit is the Database Transaction Unit aka DTU.

Based on the performance trending it comes to something like this (courtesy of What the heck is a DTU?):

Number Cores IOPS Memory DTUs Service Tier Comparable Azure VM Size
1 core, 5% utilization 10 ??? 5 Basic Standard_A0, barely used
<1 core 150 ??? 100 Standard S0-S3 Standard_A0, not fully utilized
1 core up to 4000 ??? 500 Premium – P4 Standard_DS1_v2
2-3 cores up to 12000 ??? 1000 Premium – P6 Standard_DS3_v2
4-5 cores up to 20000 ??? 1750 Premium – P11 Standard_DS4_v2

Notice that this is a comparable size…and that P11 DTU is about CAD$15,000/month!!! And while PaaS is great, the price tag is a little too much for the common mortals getting into DWH.

Anyhow, trying to recreate a cheaper P11 I went on a picked a VM size of E8s_v3, because it said  “Great for relational database servers, medium to large caches, and in-memory analytics”. More sizes here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes 

Size vCPU Memory: GiB Temp storage (SSD) GiB Max data disks Max cached and temp storage throughput: IOPS / MBps (cache size in GiB) Max uncached disk throughput: IOPS / MBps Max NICs / Expected network bandwidth (Mbps)
Standard_E8s_v3 8 64 128 16 16,000128 (200) 12,800 / 192 4 / 4,000

And so to host my data, I am adding the best disk I find on Azure:

This one, because it is closer to the size I need.
P20 disk is 512 GB, it can achieves 2300 IOPS and up to 150 MB/s.

When I could have gone for the highest:
P30 disk is 1024 GB, it can achieves 5000 IOPS and up to 200 MB/s. And only comes in 1TB size

The key thing* to understand is that the VM tier will limit the IOPS of what you connect to it.

And so here, while my disk is rated at 2300 IOPS and 150 MB/s, the machine specifications are going to limit me at 16,000 IOPS (no problem here) / but only 128 MB/s which is ok because P20 is only 150 MB/s anyway.

Once the VM is brought up online and the disk attached, a quick benchmark will match the specifications given:

iaas iops

 

 

135 MB/s for 150 MB/s advertised…

But wait, is this fast? My laptop gives me way more

laptop iops.png

How to attach faster disks to my VMs then?  What if I want to create a VHD with 600 GB and 400 MB/s of throughput?

You will not obtain such throughput if you just create a 600 GB VHD, because Azure will create a 600 GB VHD on a P30 Disk, and then you will have only 200 MB/s.

To achieve that, you should use stripping, and to do that, you can proceed with different ways:

  • Create 2 600 GB VHDs. Azure will create them using P30 disks. Then you use your stripping tool (Storage spaces) to create a 1200 GB volume. This volume will permit 400 MB/s and 10000 IOPS. But in this case, you will have 600 un-needed GB
  • Create 3 VHDs with 200 GB each. Azure will create them using P20 disks. Then you use your stripping tool (Storage spaces) to create a 600 GB volume. This volume will permit 450 MB/s (150 MB/s *3) and 6900 IOPS (2300 IOPS *3).

Wait, I need more! What if I want to create a VHD with 600 GB and 600 MB/s of throughput? Unfortunately, we can’t just dream, and ask Azure to do it, not till today. In fact, the maximum throughput possible is 512 MB/s, we can’t do better.

Ok. You have created the stripped volumes and you are still not getting what you want? Remember the key thing I mentioned above * ? The total data storage, the IOPS and the throughput are limited by the VM series and size. Each Azure Virtual Machine type is limited by a number of disks (total storage size), a maximum IOPS (IOPS) and a maximum throughput (Throughput). For example, you  may achieve a 128 MB/s only in a Standard_E8s_v3.  All the other VM types will throttle your IOPS or throughput when you reach the threshold.

When I was looking at the Memory optimized server size, there are Storage optimized VM sizes.

Size vCPU Memory: GiB Temp storage (SSD) GiB Max data disks Max cached and temp storage throughput: IOPS / MBps (cache size in GiB) Max uncached disk throughput: IOPS / MBps Max NICs / Expected network bandwidth (Mbps)
Standard_E8s_v3 8 64 128 16 16,000128 (200) 12,800 / 192 4 / 4,000
Standard_L8s 8 64 32 32 40,000 / 400 10,000 / 250 4 / 8,000

And so from the disk creation steps above, you will also want to get a VM that will be able to provide you with the throughput you need.

#azure, #database, #dtu, #iaas, #mssql, #paas, #sql

IBM v7000 WebUI not responding

I ran into the case of non-responding WebUI to manage LUNs (and other things) on a IBM Storwize v7000.
You can only run satask commands if you connected to the CLI using the SSH private key which is associated with the user called superuser. No other SSH key will allow you to run satask commands:

IBM_Storwize:SAN1:superuser>sainfo lsservicenodes
panel_name cluster_id cluster_name node_id node_name relation node_status error_data
01-1 00000100204029A2 SAN1 1 node1 local Active
01-2 00000100204029A2 SAN1 2 node2 partner Active

To find out which node is the configuration node, run the following command

sainfo lsservicestatus <Active panel name>

the following line tells you which node is the config node:

IBM_Storwize:SAN1:superuser>sainfo lsservicestatus 01-1
panel_name 01-1
cluster_id 00000100204029a2
cluster_name SAN1
cluster_status Active
cluster_ip_count 2
cluster_port 1
cluster_ip 172.xx.xx.33
cluster_gw 172.xx.xx.252
cluster_mask 255.255.254.0

Use the following command to restart the web service on the configuration node

satask restartservice -service tomcat 01-1

Wait at least 5 minutes for the service to restart before assuming that this has failed

If this do no’t solve your problems, you should restart the cannister with configuration node on it. Remember to connect to the Node that’s not configuration node. And restart the “partner”. If unable to get it reset, reset the canister physically. that is, pull it out a few inches from the chassis and then insert it again after 30 seconds.

#ibm, #storage-2, #v7000, #webui

After a migration to Exchange Online Exchange Autodiscover SCP got in the way

There was once a customer who was on Exchange 2013 on-prem for their email needs, at some point they decided (thanks to Microsoft enticing pricing) to go with Office 365 and migrate all of their mailboxes and email needs to the cloud.

After the Exchange servers were put out of the equation, users and new outlook set up started seeing some uninteresting error/warning messages:

Troubleshooting Certificate Mismatch Warnings in Outlook ...

And so, while moving to Office 365, all of the DNS entries (internal and external) for autodiscover had been changed to autodiscover.outlook.com but the old one: autodiscover.contoso.com – in theory, the on-prem exchange server CAS – would still be around and generate the message. So where would this on-prem reference come from?

While on-prem, Exchange would have been configured to add entries to Active Directory and here in the name of a Service Connection Point (SCP) object and as the domain joined machine want to auto-configure Outlook, they would end up finding of that older reference.

Autodiscover functional process

The solution is to remove this from AD! There are multiple ways to prevent Outlook from contacting the local Exchange server first…

Using Exchange Management Shell (EMS)
The preferred way is to use the Exchange Management Shell to clear the entry for the Client Access server from the SCP.

[PS] C:\Windows\system32>Get-ClientAccessServer | fl *uri*
AutoDiscoverServiceInternalUri : https://webmail.contoso.com/autodiscover/autodiscover.xml
AutoDiscoverServiceInternalUri : https://webmail.contoso.com/autodiscover/autodiscover.xml
[PS] C:\Windows\system32>Set-ClientAccessServer -Identity cas-ex1 -AutoDiscoverServiceInternalUri $null
[PS] C:\Windows\system32>Set-ClientAccessServer -Identity cas-ex2 -AutoDiscoverServiceInternalUri $null

ADSIEdit
If the above method can no longer be used a low-level AD editor as EDSIEdit can be used to remove the SCP manually. The full path of the SCP is:

CN=ServerName,CN=Autodiscover,CN=Protocols,CN=ServerName,CN=Servers,CN=Exchange Administrative Group (FGH124FG788IYF),CN=Administrative Groups,CN=OrganizationName,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=DomainName,DC=Suffix

image

This object to remove has the Class type serviceConnectionPoint.

Some useful tools to help out finding out any autodiscover issues:
SARA (microsoft Support And Recovery Assistant): https://diagnostics.office.com
Some reference from technet (yes I still call it technet): https://technet.microsoft.com/en-us/library/bb124251.aspx
More about SCPs:https://msdn.microsoft.com/en-us/library/office/dn467397(v=exchg.150).aspx

#exchange, #office-365, #outlook

Netflix and opendns are not friends – netflix support sucks.

For some reason I had not opted to let the Netflix app on my Android TV and one day with an Android update it decided it was going to be using 4.0.4 build 1716 instead.

Then Netflix would not be able to load any video with somewhat getting stuck at 25% with the lovely error of tvq-pm-100 3.1-52.

Life after work ended. The evening entertainment was ruined for ever and I was not the same again. The downwards spirals was inevitable.

I chatted with Netflix, spends hours on the phone going through the meaningless scripted troubleshoot – had I restart my TV box? Log off and back on? clear the cache? reset the appliance? nothing I was on the verge of video deprivation.

The most intriguing aspect was the competent Netflix staff would say: well as it is not us, it must your network provider. Yet not able to say what getting stuck at 25% could mean. Where are the good old logs telling what is going on when you need them?

I then read on a forum that the Netflix Android TV app would rely on Google DNS to geo-triangulate you and spy on you.

In order to protect my household I had opted long ago for opendns to block the doubleclick and other webspam of the universe without issues in the previous versions of Netflix.

In the end, changing the DNS setting on that Android TV to use Google’s infamous DNS 8.8.8.8 and 8.8.4.4 to see Netflix videos loading at lightning speed and that very same Android TV box could again spy on me at will.

Thanks to Google’s sneakiness the end of the world was avoided.

#google, #netflix, #netflix-sucks, #opendns, #privacy, #support-sucks

Copy windows features from a server to another

Sometimes you want to create the (almost) same server where you do not yet Chef or CF or some sort of DSC. The best resort is to use what you have: get-windowsfeature

Imagine you want to configure Server B from Server A and obviously those are microsoft windows servers…

#On server A
#export features
> Get-WindowsFeature | ? { $_.Installed -AND $_.SubFeatures.Count -eq 0 } | Export-Clixml .\serverA.xml
#copy the feature file over
> cp .\serverA.xml ‘\\serverB\c$\Files’

#On server B
PS C:\Files> ls
Directory: C:\Files
Mode LastWriteTime Length Name
—- ————- —— —-
-a—- 6/8/2017 4:15 PM 510824 ServerA.xml
PS C:\Files> Import-Module Servermanager
PS C:\Files> Import-Clixml .\ServerA.xml | Add-WindowsFeature
Success Restart Needed Exit Code Feature Result
——- ————– ——— ————–
True Yes SuccessRest… {Application Server, .NET Framework 4.5, W…
WARNING: You must restart this server to finish the installation process.

Voila, another posh timesaver.

#clone, #get-windowsfeature, #powershell, #server, #windows-2