Recommended registry tweaks for SCOM 2016 management servers

March 8, 2017, 12:03 pm

≫ Next: Management Pack authoring the REALLY fast and easy way, using Silect MP Author and Fragments

≪ Previous: SCOM Agent Version Addendum Management Pack

I will start with what people want most – the “list”:

These are the most common changes and settings I recommend to adjust on SCOM management servers.

Simply run these from an elevated command prompt on all your management servers.

reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "State Queue Items" /t REG_DWORD /d 20480 /f
reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "Persistence Checkpoint Depth Maximum" /t REG_DWORD /d 104857600 /f
reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPool" /t REG_DWORD /d 1 /f
reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPoolSeconds" /t REG_DWORD /d 60 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0" /v "GroupCalcPollingIntervalMilliseconds" /t REG_DWORD /d 900000 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Command Timeout Seconds" /t REG_DWORD /d 1800 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Deployment Command Timeout Seconds" /t REG_DWORD /d 86400 /f

I will explain each setting in detail below:

1. HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value: State Queue Items = 20480

SCOM 2016 default existing registry value: (not present)

SCOM 2016 default value in code: 10240

Description: This sets the maximum size of healthservice internal state queue. It should be equal or larger than the number of monitor based workflows running in a healthservice. Too small of a value, or too many workflows will cause state change loss. http://blogs.msdn.com/b/rslaten/archive/2008/08/27/event-5206.aspx

2. HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value: Persistence Checkpoint Depth Maximum = 104857600

SCOM 2016 default existing registry value = 20971520

Description: Management Servers that host a large amount of agentless objects, which results in the MS running a large number of workflows: (network/URL/Linux/3rd party/VEEAM) This is an ESE DB setting which controls how often ESE writes to disk. A larger value will decrease disk IO caused by the SCOM healthservice but increase ESE recovery time in the case of a healthservice crash.

3. HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
DALInitiateClearPool = 1
DALInitiateClearPoolSeconds = 60

SCOM 2016 existing registry value: not present

Description: This is a critical setting on ALL management servers in ANY management group. This setting configures the SDK service to attempt a reconnection to SQL server upon disconnection, on a regular basis. Without these settings, an extended SQL outage can cause a management server to never reconnect back to SQL when SQL comes back online after an outage. Per: http://support.microsoft.com/kb/2913046/en-us All management servers in a management group should get the registry change.

4. HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\
REG_DWORD Decimal Value: GroupCalcPollingIntervalMilliseconds = 900000

SCOM 2016 existing registry value: (not present)

SCOM 2016 default code value: 30000 (30 seconds)

Description: This setting will slow down how often group calculation runs to find changes in group memberships. Group calculation can be very expensive, especially with a large number of groups, large agent count, or complex group membership expressions. Slowing this down will help keep groupcalc from consuming all the healthservice and database I/O. 900000 is every 15 minutes.

5. HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value: Command Timeout Seconds = 1800

SCOM 2016 existing registry value: (not preset)

SCOM 2016 default code value: 600

Description: This helps with dataset maintenance as the default timeout of 10 minutes is often too short. Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance. This is a very common issue. http://blogs.technet.com/b/kevinholman/archive/2010/08/30/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu.aspx This should be adjusted to however long it takes aggregations or other maintenance to run in your environment. We need this to complete in less than one hour, so if it takes more than 30 minutes to complete, you really need to investigate why it is so slow, either from too much data or SQL performance issues.

6. HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value: Deployment Command Timeout Seconds = 86400

SCOM 2016 existing registry value: (not preset)

SCOM 2016 default code value: 10800 (3 hours)

Description: This helps with deployment of heavy handed scripts that are applied during version upgrades and cumulative updates. Customers often see blocking on the DW database for creating indexes, and this causes the script not to be able to deployed in the default of 3 hours. Setting this value to allow for one full day to deploy the script resolves most customer issues. Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance after a version upgrade or UR deployment. This is a very common issue in large environments are very large warehouse databases.

Ok, that covers the “standard” stuff.

I will cover one other registry modification that is RARELY needed. You should ONLY change this one if directed to by Microsoft support.

WARNING:

If you make changes to this setting, the same change must be made on ALL management servers, otherwise the resource pools will constantly fail. All management servers must have identical settings here. If you add a management server in the future, this setting must be applied immediately if you modified it on other management servers, or you will see your resource pools constantly committing suicide and failing over to other management servers, reinitializing all workflows in a loop. All the other settings in this article are generally beneficial. This specific one for PoolManager should receive great scrutiny before changing, due to the risks. It is NOT included in my reg-add list above for good reason.

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value:
PoolLeaseRequestPeriodSeconds = 600
PoolNetworkLatencySeconds = 120

SCOM 2016 existing registry value: not present (must create PoolManager key and both values) Default code value = 120/30 seconds

This is VERY RARE to change, and in general I only recommend changing this under advisement from a support case. The resource pools work quite well on their own, and I have worked with very large environments that did not need these to be modified. This is more common when you are dealing with a rare condition, such as management group spread across datacenters with high latency links, DR sites, MASSIVE number of workflows running on management servers, etc.

↧

Management Pack authoring the REALLY fast and easy way, using Silect MP Author and Fragments

March 22, 2017, 4:46 pm

≫ Next: Enable proxy as a default setting in SCOM 2016

≪ Previous: Recommended registry tweaks for SCOM 2016 management servers

Silect MP Author Professional just added support for Visual studio fragments. If you didn’t get to attend the webinar on this – here is the recording.

MP Authoring just got really easy, and really FAST. Check out the video and see how using MP fragments can take your SCOM environment to a whole new level.

Link to recording: https://youtu.be/E5nnuvPikFw

Silect MP Author Pro: http://www.silect.com/mp-author-professional/

Kevin Holman’s Fragment Library: https://gallery.technet.microsoft.com/SCOM-Management-Pack-VSAE-2c506737

Using fragments in Visual Studio (Previous session) recording: https://youtu.be/9CpUrT983Gc

↧

Enable proxy as a default setting in SCOM 2016

April 10, 2017, 12:51 pm

≫ Next: Agent Management Pack – Making a SCOM Admin’s life a little easier

≪ Previous: Management Pack authoring the REALLY fast and easy way, using Silect MP Author and Fragments

system_center_operations_manager_replacement_icon_by_flakshack-d5mxgid

The default setting for new SCOM agents is that Agent Proxy is disabled. You can enable this agent by agent, or for specific agents with script automations. I find this to be a clumsy task, and more and more management packs require this capability to be enabled, like Active Directory, SharePoint, Exchange, Clustering, Skype, etc. At some point, it makes a lot more sense to just enable this as a default setting, and that is what I advise my customers.

Set it, and forget it. One of the FIRST things I do after installing SCOM.

(This also works just fine and exactly the same way in SCOM 2012, 2012 SP1, and 2012R2.)

On a SCOM management server: Open up any PowerShell session (SCOM shell or regular old PowerShell)

add-pssnapin "Microsoft.EnterpriseManagement.OperationsManager.Client";
new-managementGroupConnection -ConnectionString:localhost;
set-location "OperationsManagerMonitoring::";
Set-DefaultSetting -Name HealthService\ProxyingEnabled -Value True

If you want to use this remotely – change “localhost” above to the FQDN of your SCOM server.

In order to inspect this setting, you can run:

add-pssnapin "Microsoft.EnterpriseManagement.OperationsManager.Client";
new-managementGroupConnection -ConnectionString:localhost;
set-location "OperationsManagerMonitoring::";
Get-DefaultSetting

↧

Agent Management Pack – Making a SCOM Admin’s life a little easier

May 9, 2017, 10:50 am

≫ Next: How does CPU monitoring work in the Windows Server 2016 management pack?

≪ Previous: Enable proxy as a default setting in SCOM 2016

This is a little example MP for some things that are possible with SCOM. It also serves as a good example MP on how to write classes, discoveries, and most importantly many task examples for command line, VBscript, and PowerShell.

I didn’t write all these – a bunch of ideas came from Jimmy Harper, Matt Taylor, and Tim McFadden and their MP’s. This was more to combine lots of useful administration in one place.

First – useful discovered properties:

The “real” agent version

The UR level of the agent

Any Management Groups that the agent belongs to. This is nice to see for old management groups that get left behind.

A check if PowerShell is installed and what version. This is important because PowerShell 2.0 is required on all agents if you want to move to SCOM 2016.

CLR .NET runtime version available to PowerShell

OS Version and Name

Primary and Failover management servers. I am getting this straight from the agents config XML file, sometimes agents might not be configured as you think – this is from the authoritative source…. what’s in that specific agents config.

Lastly, the default Agent Action account. Helpful to find any agents where someone installed incorrectly.

Next up – the tasks:

One of the problems with tasks, is that they are scoped to a specific class. Some cool tasks are attached to Windows Computer, some to HealthService, some to specific app classes. Or – people write tasks and scope to System.Entity. This places the task in ALL views. That’s handy, but if everyone did that we’d have an unusable console for tasks.

Computer Management – duh.

Create Test Event – this task creates event 100 with source TEST in the app event log, and there is a rule in the MP to generate an info alert. This will let you test end to end agent function, and notifications.

Execute any PowerShell – this task accepts one parameter – “ScriptBody” which allows you to pass any powershell statements and they will execute locally on the agent and return output:

Execute any Service Restart – this will take a servicename as a parameter and restart the service on any agent on demand. You should NOT use this for the Healthservice – there is a special task for that:

Execute any Software from Share – this task will accept an executable or command line including an e4xecutable, and a share path which contains the software, and it will run it locally on the agent. This is useful to install missing UR updates, or any other software you want deployed. This will require that “Domain Computers” have read access to the files on the share.

Export Event Log – this task will export any local event log and save the export to a share. It will require that the “Domain Computers” have write access to the share.

HealthService – Flush – This task will stop the agent service, delete the health service store, cache, and config, and start the service back up, provoking a complete refresh of the agents config, management packs, and ESE database store.

HealthService – Restart – This is a special task which will reliably bounce the HealthService on agents using an “out of band” script process. Many scripts to bounce the agent service fail because when the service stops, the script to start it back up is destroyed from memory.

Management Group – ADD and Management Group – REMOVE – these are script based tasks to add or remove a management group from an agent

Ping – (Console Task) – Duh

Remote Desktop – (Console Task) – Duh

Do you have other useful agent management tasks that you think should be in a pack like this? Or discovered properties that are useful as well? I welcome your feedback.

Additionally – I have created two versions of this MP. One with everything above, and one without the “risky” tasks, like exposing the ability to execute any PowerShell, restart any service, and install any software from a share. If those are things you don’t ever want exposed in your SCOM environment – import the other MP. You can control who sees which tasks, but by default operators will see tasks.

Download the MP here: https://gallery.technet.microsoft.com/SCOM-Agent-Management-b96680d5

↧

How does CPU monitoring work in the Windows Server 2016 management pack?

May 13, 2017, 3:31 pm

≫ Next: UR13 for SCOM 2012 R2 – Step by Step

≪ Previous: Agent Management Pack – Making a SCOM Admin’s life a little easier

First – let me warn you. The way SCOM monitors Processor time is *incredibly* complicated. If you don’t like it – there is *NOTHING* wrong with nuking this from orbit (disable via override) and just create your own very simple consecutive samples (or average) monitor. That said, while complicated and difficult to understand, it is very powerful and useful, and limits “noise”.

Ok, all warnings aside – lets figure out how this works.

In the Windows Server 2016 OS Management Pack, there is a built in monitor which evaluates the Processor load. This monitor (Total CPU Utilization Percentage or Microsoft.Windows.Server.10.0.OperatingSystem.TotalCPUUtilization) targets the “Windows Server 2016 Operating System” class.

It runs every 15 minutes, and evaluates after 3 samples. The samples are not consecutive samples as the product knowledge states – they are AVERAGE samples.

Like previous versions of the CPU monitor, this is often misunderstood. This monitor does not use a native perfmon module, it runs a PowerShell script. The script evaluates TWO DIFFERENT perfmon counters:

Processor Information / % Processor Time / _Total (default threshold 95)

System / Processor Queue Length (default threshold 15)

BOTH of these above thresholds must be met, before we will create a monitor state change/alert. This means that even if your server is stuck at 100% CPU utilization, it will not genet an alert most of the time. Smile

The default threshold of “15” is multiplied times the number of logical CPU’s for the server. So on a typical VM with 4 virtual CPU’s, this means that the value of SYSTEM\Processor Queue Length must be great than (15*4) = 60. Not only that, but the value must be above 60 for the average of any three consecutive samples. This is incredibly high.

What this means, is that it is VERY unlikely this monitor will ever trigger, unless your system is absolutely HAMMERED. If you like this, great! If you don’t like this, then you have two options.

1) Re-write your own monitor and make it a very simple consecutive or average samples threshold performance monitor.

2) Override the default monitor – but set the “CPU Queue Length” threshold to “zero” as in the picture below:

This will result in the equation ignoring the CPU queue length requirement, and make the monitor consider “% Processor Time” only. If you find this is too noisy, you can use the CPU queue length, but use lower value than the default of 15. Another thing to keep in mind, this is a PowerShell script based monitor, so if you want to run this VERY frequently (the default is every 15 minutes) then consider replacing it with a less impactful native perfmon based monitor.

The default monitor has a recovery on it – that will output the top consuming processes to health explorer state change context:

Note – the numbers are not exactly correct – my “ProcessorHog” process was consuming 100% of the CPU…. but this server has 32 cores, so it looks like you need to multiply by the number of cores to understand the ACTUAL utilization consumed by a process. This is a typical Windows problem in how windows looks at processes, not a SCOM issue.

Ok – so that covers the basic monitoring of the CPU, from an _Total perspective.

What about monitoring individual *logical processors* like virtual CPU’s or actual cores on physical servers? Can we do that?

Yes, yes we can.

First – let me start by saying – I DON’T recommend you do this. In fact, I recommend AGAINST this. This type of monitoring is INCREDIBLY detailed, and creates a huge instance space in SCOM that will only serve to slow down your environment, console, and increase config and monitoring load. It should only be leveraged where you have a very specific need to monitor individual logical processing cores for very specific reasons, which should be rare.

There is a VERY specific scenario where this type of monitoring might be useful…. that is when an individual single threaded process “runs away” on CPU 0, core 0. This has been seen on Skype servers and will impact server performance. So if you MUST monitor for this condition, you can consider discovering these individual CPU’s. I still don’t recommend it and certainly not across the board.

Ok, all warnings aside – lets figure out how this works.

There is an optional discovery (disabled by default) in the Windows Server 2016 Operating System (Discovery) management pack, to discover individual CPU’s: “Discover Windows CPUs” (Microsoft.Windows.Server.10.0.CPU.Discovery) This discovery runs once a day, and calls the Microsoft.Windows.Server.10.0.CPUDiscovery.ModuleType datasource. This datasource runs a PowerShell script that discovers two object types:

1. Microsoft.Windows.Server.10.0.Processor (Windows Server 2016 Processor)

2. Microsoft.Windows.Server.10.0.LogicalProcessor (Windows Server 2016 Logical Processor)

If you enable this discovery – you will discover both types:

Let’s start with “Windows Server 2016 Processor”. This class discovers actual physical or virtual Processors in sockets, as they are exposed to the OS by physical hardware or the virtualization layer. See example below:

Physical server:

VM guest:

By contrast – the “Windows Server 2016 Logical Processor” class shows instances of physical or virtual “Logical Processors” which will be virtual processors on a VM, and logical CPU’s exposed to the physical layer – either actual cores or hyper-threaded cores:

The former is how all our previous monitoring worked for individual CPU monitoring, which is pretty much worthless. If we need to monitor cores, we generally don’t care about “sockets”.

The latter is new for Windows Server 2016 management pack, which actually discovers individual logical CPU’s as seen by the OS.

Now – lets look at the monitoring provided out of the box.

IF you enable discover the individual CPU discovery, there are three monitors targeting the “Windows Server 2016 Processor” class, one of which is enabled out of the box. This is “CPU percentage Utilization” It runs every three minutes, 5 samples, with a threshold of “10”. It is also a PowerShell script based monitor.

Comments on above:

1. Monitoring for individual “socket” utilization seems really silly to me, and not useful at all. You probably should not use this.

2. The default threshold of “10” is WAY too low…. I have no idea why we would use that.

3. The counter uses “Processor” perfmon object instead of the newer “Processor Information” The reason this isn’t a simple change, is because the “Performance Monitor Instance Name” class property doesn’t match the newer counters instance value.

Additionally, there are three rules to collect perfmon data – one of which is enabled. You should disable this collection rule as well, IF you just HAVE to discover individual CPU’s.

Ok, now lets move on to the Windows Server 2016 Logical Processor.

This is more useful as it will monitor individual CORE’s (or virtual CPU’s) to look for runaway single threaded processes.

There are three monitors out of the box targeting this class and NONE of these are enabled by default.

The one for CPU util, Microsoft.Windows.Server.10.0.LogicalProcessor.CPUUtilization is a native perfmon monitor for consecutive samples. I like this WAY better than complicated and heavy handed script based monitors.

HOWEVER – this will potentially be VERY noisy – as a server will have multiple CPU’s, and these will alarm anytime the _Total condition is met. This means duplication of alerts when a server is heavily utilized. That said – if only a SINGLE logical processor is spiked, but the overall CPU utilization is low, this will let you know that is happening.

Bottom line:

1. CPU monitoring of the OS level is complex, script based, and uses multiple perf counters before it triggers. Be aware, and be proactive in managing this.

2. The individual CPU’s can be discovered, but I DON’T recommend it as a general rule.

3. The default rules and monitors enabled for individual CPU monitoring focuses on SOCKETS, and isn’t very useful, and should be disabled.

4. The new Logical Processor class for the Server 2016 MP is more useful as it monitors cores/logical CPU’s, but all monitoring is disabled by default.

↧

UR13 for SCOM 2012 R2 – Step by Step

May 27, 2017, 9:28 am

≫ Next: UR3 for SCOM 2016 – Step by Step

≪ Previous: How does CPU monitoring work in the Windows Server 2016 management pack?

KB Article for OpsMgr: https://support.microsoft.com/en-us/help/4016125

Download catalog site: http://www.catalog.update.microsoft.com/Search.aspx?q=4016125

Updated UNIX/Linux Management Packs: https://www.microsoft.com/en-us/download/details.aspx?id=29696

NOTE: I get this question every time we release an update rollup: ALL SCOM Update Rollups are CUMULATIVE. This means you do not need to apply them in order, you can always just apply the latest update. If you have deployed SCOM 2012R2 and never applied an update rollup – you can go straight to the latest one available. If you applied an older one (such as UR3) you can always go straight to the latest one!

Key Fixes:

After you install Update Rollup 11 for System Center 2012 R2 Operations Manager, you cannot access the views and dashboards that are created on the My Workspace tab.
When the heartbeat failure monitor is triggered, a “Computer Not Reachable” message is displayed even when the computer is not down.
The Get-SCOMOverrideResult PowerShell cmdlet does not return the correct list of effective overrides.
When there are thousands of groups in a System Center Operations Manager environment, the cmdlet Get-SCOMGroup -DisplayName “group_name“ fails, and you receive the following message:

The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

When you run System Center 2012 R2 Operations Manager in an all-French locale (FRA) environment, the Date column in the Custom Event report appears blank.
The Enable deep monitoring using HTTP task in the System Center Operations Manager console does not enable WebSphere deep monitoring on Linux systems.
When overriding multiple properties on rules that are created by the Azure Management Pack, duplicate override names are created. This issue causes overrides to be lost.
When creating a management pack (MP) on a client that contains a Service Level (SLA) dashboard and Service Level Objects (SLO), the localized names of objects are not displayed properly if the client’s CurrentCulture settings do not match the CurrentUICulture settings. In the case where the localized settings are English English, ENG, or Australian English, ENA, there is an issue when the objects are renamed.
This update adds support for OpenSSL1.0.x on AIX computers. With this change, System Center Operations Manager uses OpenSSL 1.0.x as the default minimum version supported on AIX, and OpenSSL 0.9.x is no longer supported.

Lets get started.

From reading the KB article – the order of operations is:

Install the update rollup package on the following server infrastructure:
- Management servers
- Audit Collection servers
- Gateway servers
- Web console server role computers
- Operations console role computers
- Reporting
Apply SQL scripts.
Manually import the management packs.
Update Agents
Unix/Linux management packs and agent updates.

1. Management Servers

It doesn’t matter which management server I start with. There is no need to begin with whomever holds the “RMSe” role. I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

I can apply this update manually via the MSP files, or I can use Windows Update. I have 2 management servers, so I will demonstrate both. I will do the first management server manually. This management server holds 3 roles, and each must be patched: Management Server, Web Console, and Console.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

Then extract the contents:

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note: You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator role to the SQL database instances that host your OpsMgr databases.

My first server is a Management Server Role, and the Web Console Role, and has the OpsMgr Console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

This launches a quick UI which applies the update. It will bounce the SCOM services as well. The update usually does not provide any feedback that it had success or failure.

You *MAY* be prompted for a reboot. You can click “No” and do a single reboot after fully patching all roles on this server.

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Date:          5/25/2017 9:01:13 AM
Event ID:      1036
Description:
Windows Installer installed an update. Product Name: System Center Operations Manager 2012 Server. Product Version: 7.1.10226.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2012 R2 Operations Manager UR13 Update Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute.

Next up – run the Web Console update:

This runs much faster. A quick file spot check:

Lastly – install the console update (make sure your console is closed):

A quick file spot check:

Additional Management Servers:

I now move on to my additional management servers, applying the server update, then the console update and web console update where applicable.

On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files. I check online, and make sure that I have configured Windows Update to give me updates for additional products:

The applicable updates show up under optional – so I tick the boxes and apply these updates.

After a reboot – go back and verify the update was a success by spot checking some file versions like we did above.

Updating ACS (Audit Collection Services)

You would only need to update ACS if you had installed this optional role.

On any Audit Collection Collector servers, you should run the update included:

A spot check of the files:

Updating Gateways:

I can use Windows Update or manual installation.

The update launches a UI and quickly finishes.

You MAY be prompted for a reboot.

Then I will spot check the DLL’s:

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

***NOTE: You can delete any older UR update files from the \AgentManagement directories. The UR’s do not clean these up and they provide no purpose for being present any longer.

I can also apply the GW update via Windows Update:

Reporting Server Role Update

I kick off the MSP from an elevated command prompt:

This runs VERY fast and does not provide any feedback on success or failure.

NOTE: There is an RDL file update available to fix a bug in business hours based reporting. See the KB article for more details. You can update this RDL optionally if you use that type of reporting and you feel you are impacted.

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there are two SQL script files:

%SystemDrive%\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\SQL Script for Update Rollups
(note – your path may vary slightly depending on if you have an upgraded environment or clean install)

First – let’s run the script to update the OperationsManagerDW (Data Warehouse) database. Open a SQL management studio query window, connect it to your Operations Manager DataWarehouse database, and then open the script file (UR_Datawarehouse.sql). Make sure it is pointing to your OperationsManagerDW database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR. The script body can change so as a best practice always re-run this.

If you see a warning about line endings, choose Yes to continue.

Click the “Execute” button in SQL mgmt. studio. The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output: “Command(s) completes successfully”

Next – let’s run the script to update the OperationsManager (Operations) database. Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file (update_rollup_mom_db.sql). Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR. The script body can change so as a best practice always re-run this.

Click the “Execute” button in SQL mgmt. studio. The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

I have had customers state this takes from a few minutes to as long as an hour. In MOST cases – you will need to shut down the SDK, Config, and Monitoring Agent (healthservice) on ALL your management servers in order for this to be able to run with success.

You will see the following (or similar) output:

IF YOU GET AN ERROR – STOP! Do not continue. Try re-running the script several times until it completes without errors. In a production environment with lots of activity, you will almost certainly have to shut down the services (sdk, config, and healthservice) on your management servers, to break their connection to the databases, to get a successful run.

Technical tidbit: Even if you previously ran this script in any previous UR deployment, you should run this again in this update, as the script body can change with updated UR’s.

3. Manually import the management packs

There are 58 management packs in this update! Most of these we don’t need – so read carefully.

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific. Only import the ones you need, and that are correct for your language. I will remove all the MP’s for other languages (keeping only ENU), and I am left with the following:

What NOT to import:

The Advisor MP’s are only needed if you are connecting your on-prem SCOM environment to Microsoft Operations Management Suite cloud service (OMS), (Previously known as Advisor, and Operations Insights).
The APM MP’s are only needed if you are using the APM feature in SCOM.
The Alert Attachment and TFS MP bundle is only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc. If you are not currently using these MP’s, there is no need to import or update them. I’d skip this MP import unless you already have these MP’s present in your environment.
However, the Image and Visualization libraries deal with Dashboard updates, and these always need to be updated.
I import all of these shown without issue.

4. Update Agents

Agents should be placed into pending actions by this update for any agent that was not manually installed (remotely manageable = yes):

One the Management servers where I used Windows Update to patch them, their agents did not show up in this list. Only agents where I manually patched their management server showed up in this list. FYI. The experience is NOT the same when using Windows Update vs manual. If yours don’t show up – you can try running the update for that management server again – manually.

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending.

In this case – my agents that were reporting to a management server that was updated using Windows Update – did NOT place agents into pending. Only the agents reporting to the management server for which I manually executed the patch worked.

I manually re-ran the server MSP file manually on these management servers, from an elevated command prompt, and they all showed up.

You can approve these – which will result in a success message once complete:

Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

I recommend you consider the following MP which will benefit the Agents by version so you can see the agent version *number* under Agent Managed in Administration:

https://blogs.technet.microsoft.com/kevinholman/2017/02/26/scom-agent-version-addendum-management-pack/

5. Update Unix/Linux MPs and Agents

The current Linux MP’s can be downloaded from:

https://www.microsoft.com/en-us/download/details.aspx?id=29696

7.5.1070.0 is the SCOM 2012 R2 UR12 release version.

****Note – take GREAT care when downloading – that you select the correct download for SCOM 2012 R2. You must scroll down in the list and select the MSI for 2012 R2:

Download the MSI and run it. It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 R2 Management Packs for Unix and Linux\

Update any MP’s you are already using. These are mine for RHEL, SUSE, and the Universal Linux libraries.

You will likely observe VERY high CPU utilization of your management servers and database server during and immediately following these MP imports. Give it plenty of time to complete the process of the import and MPB deployments.

Next – you need to restart the “Microsoft Monitoring Agent” service on any management servers which manage Linux systems. I don’t know why – but my MP’s never drop/update the UNIX/Linux agent files in the \Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\AgentManagement\UnixAgents\DownloadedKits folder until this service is restarted.

Next up – you would upgrade your agents on the Unix/Linux monitored agents. You can now do this straight from the console:

You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.

Finally:

6. Update the remaining deployed consoles

This is an important step. I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc. These should all get the matching update version.

You can use Help > About to being up a dialog box to check your console version:

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

Known issues:

See the existing list of known issues documented in the KB article.

1. Many people are reporting that the SQL script is failing to complete when executed. You should attempt to run this multiple times until it completes without error. You might need to stop the Exchange correlation engine, stop all the SCOM services on the management servers, and/or bounce the SQL server services in order to get a successful completion in a busy management group. The errors reported appear as below:

——————————————————
(1 row(s) affected)
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
——————————————————–

↧

UR3 for SCOM 2016 – Step by Step

May 27, 2017, 12:32 pm

≫ Next: Installing SQL 2016 Always On with Windows Server 2016 Core

≪ Previous: UR13 for SCOM 2012 R2 – Step by Step

KB Article for OpsMgr: https://support.microsoft.com/en-us/help/4016126/update-rollup-3-for-system-center-2016-operations-manager

Download catalog site: http://www.catalog.update.microsoft.com/Search.aspx?q=4016126

Updated UNIX/Linux Management Packs: https://www.microsoft.com/en-us/download/details.aspx?id=29696

NOTE: I get this question every time we release an update rollup: ALL SCOM Update Rollups are CUMULATIVE. This means you do not need to apply them in order, you can always just apply the latest update. If you have deployed SCOM 2016 and never applied an update rollup – you can go straight to the latest one available.

Key fixes:

The Application Performance Monitoring (APM) feature in System Center 2016 Operations Manager Agent causes a crash for the IIS Application Pool that’s running under the .NET Framework 2.0 runtime. Microsoft Monitoring Agent should be updated on all servers that use .NET 2.0 application pools for APM binaries update to take effect. Restart of the server might be required if APM libraries were being used at the time of the update.
Organizational Unit (OU) properties for Active Directory systems are not being discovered or populated.
The PatchLevel discovery script was fixed to properly discover patch level.
SQL Agent jobs for maintenance schedule use the default database. If the database name is not the default, the job fails.
When the heartbeat failure monitor is triggered, a “Computer Not Reachable” message is displayed even when the computer is not down.
An execution policy has been added as unrestricted to PowerShell scripts in Inbox management packs.
The Microsoft.SystemCenter.Agent.RestartHealthService.HealthServicePerfCounterThreshold recovery task fails to restart the agent, and you receive the following error message: (LaunchRestartHealthService.ps1 cannot be loaded because the execution of scripts is disabled on this system.) This issue has been resolved to make the recovery task work whenever the agent is consuming too much resources.
The Get-SCOMOverrideResult PowerShell cmdlet doesn’t return the correct list of effective overrides.
The Event ID: 26373 event, which happens when there are large amounts of rows returned from an SDK query, has been changed from a “Critical” event to an “Informational” event (because there is nothing you can do about it).
When you run System Center 2016 Operations Manager in an all-French locale (FRA) environment, the Date column in the Custom Event report appears blank.
The Enable deep monitoring using HTTP task in the System Center Operations Manager console doesn’t enable WebSphere deep monitoring on Linux systems.
When overriding multiple properties on rules that are created by the Azure Management Pack, duplicate override names are created. This causes overrides to be lost.
When creating a management pack (MP) on a client that contains a Service Level (SLA) dashboard and Service Level Objects (SLO), the localized names of objects aren’t displayed properly if the client’s CurrentCulture settings don’t match the CurrentUICulture settings. In cases where the localized settings are English English, ENG, or Australian English, ENA, there’s an issue when the objects are renamed.
The UseMIAPI registry subkey prevents collection of processor performance data for RedHat Linux system. Also, custom performance collection rules are also impacted by the UseMIAPI setting.
This update adds support for OpenSSL1.0.x on AIX computers. With this change, System Center Operations Manager uses OpenSSL 1.0.x as the default minimum version supported on AIX, and OpenSSL 0.9.x is no longer supported.

Lets get started.

From reading the KB article – the order of operations is:

Install the update rollup package on the following server infrastructure:
- Management server or servers
- Web console server role computers
- Gateway
- Operations console role computers
Apply SQL scripts.
Manually import the management packs.
Apply Agent Updates
Update Nano Agents
Update Unix/Linux MP’s and Agents

1. Management Servers

It doesn’t matter which management server I start with. I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location, and then extract the contents:

Once I have the MSP files, I am ready to start applying the update to each server by role.

***Note: You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator role to the SQL database instances that host your OpsMgr databases.

My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

This launches a quick UI which applies the update. It will bounce the SCOM services as well. The update usually does not provide any feedback that it had success or failure…. but I did get a reboot prompt. You can choose “No” and then reboot after applying all the SCOM role updates.

You can check the application log for the MsiInstaller events to show completion:

Log Name:      Application
Source:        MsiInstaller
Event ID:      1036
Computer:      SCOM1.opsmgr.net
Description: Windows Installer installed an update. Product Name: System Center Operations Manager 2016 Server. Product Version: 7.2.11719.0. Product Language: 1033. Manufacturer: Microsoft Corporation. Update Name: System Center 2016 Operations Manager Update Rollup 3 Patch. Installation success or error status: 0.

You can also spot check a couple DLL files for the file version attribute.

Next up – run the Web Console update:

This runs much faster. A quick file spot check:

Lastly – install the console update (make sure your console is closed):

A quick file spot check:

Or help/about in the console:

Additional Management Servers:

Windows Update contains the UR3 patches for SCOM 2016. For my second Management Server – I will demonstrate that:

Updating Gateways:

Generally I can use Windows Update or manual installation. I will proceed with manual:

The update launches a UI and quickly finishes.

Then I will spot check the DLL’s:

I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

***NOTE: You can delete any older UR update files from the \AgentManagement directories. The UR’s do not clean these up and they provide no purpose for being present any longer.

I could also apply the GW update via Windows Update:

2. Apply the SQL Scripts

In the path on your management servers, where you installed/extracted the update, there is ONE SQL script file:

%SystemDrive%\Program Files\Microsoft System Center 2016\Operations Manager\Server\SQL Script for Update Rollups
(note – your path may vary slightly depending on if you have an upgraded environment or clean install)

Next – let’s run the script to update the OperationsManager (Operations) database. Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file (update_rollup_mom_db.sql). Make sure it is pointing to your OperationsManager database, then execute the script.

You should run this script with each UR, even if you ran this on a previous UR. The script body can change so as a best practice always re-run this.

Click the “Execute” button in SQL mgmt. studio. The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

You will see the following (or similar) output:

Technical tidbit: Even if you previously ran this script in any previous UR deployment, you should run this again in this update, as the script body can change with updated UR’s.

3. Manually import the management packs

There are 33 management packs in this update! Most of these we don’t need – so read carefully.

The path for these is on your management server, after you have installed the “Server” update:

\Program Files\Microsoft System Center 2016\Operations Manager\Server\Management Packs for Update Rollups

However, the majority of them are Advisor/OMS, and language specific. Only import the ones you need, and that are correct for your language.

This is the initial import list:

What NOT to import:

The Advisor MP’s are only needed if you are using Microsoft Operations Management Suite cloud service, (Previously known as Advisor, and Operations Insights).
DON’T import ALL the languages – ONLY ENU, or any other languages you might require.
The Alert Attachment MP update is only needed if you are already using that MP for very specific other MP’s that depend on it (rare)
The IntelliTrace Profiling MP requires IIS MP’s and is only used if you want this feature in conjunction with APM.

So I remove what I don’t want or need – and I have this:

These import without issue. If the “Install” button is greyed out – this means you have an MP in your import list that is already imported and not updated. The “Microsoft System Center Advisor Resources (ENU)” MP was causing this for me – since it hasn’t been updated, I simply remove it from the list so I can install.

4. Update Agents

Agents should be placed into pending actions by this update for any agent that was not manually installed (remotely manageable = yes):

If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending by design, OR if you use Windows Update to apply the update rollup for the Server role patch.

You can approve these – which will result in a success message once complete:

You can verify the PatchLevel by going into the console and opening the view at: Monitoring > Operations Manager > Agent Details > Agents by Version

I also recommend you take a look at this community MP, which helps see the “REAL” agent number in the “Agent Managed” view console:

https://blogs.technet.microsoft.com/kevinholman/2017/02/26/scom-agent-version-addendum-management-pack/

5. Update UNIX/Linux MPs and Agents

The UNIX/Linux MP’s and agents have been updated to align with UR3 for SCOM 2016. You can get them here:

https://www.microsoft.com/en-us/download/details.aspx?id=29696

The current version of these MP’s for SCOM 2016 UR2 is 7.6.1076.0 – and includes agents with version 1.6.2-339

Make sure you download the correct version for your SCOM deployment:

Download, extract, and import ONLY the updated Linux/UNIX MP’s that are relevant to the OS versions that you want to monitor:

This will take a considerable amount of time to import, and consume a lot of CPU on the management servers and SQL server until complete.

Once it has completed, you will need to restart the Healthservice (Microsoft Monitoring Agent) on each management server, in order to get them to update their agent files at \Program Files\Microsoft System Center 2016\Operations Manager\Server\AgentManagement\UnixAgents

You should see the new files dropped with new timestamps:

Now you can deploy the agent updates:

Next – you decide if you want to input credentials for the SSH connection and upgrade, or if you have existing RunAs accounts that are set up to do the job (Agent Maintenance/SSH Account)

If you have any issues, make sure your SUDOERS file has the correct information pertaining to agent upgrade:

https://blogs.technet.microsoft.com/kevinholman/2016/11/11/monitoring-unixlinux-with-opsmgr-2016/

6. Update the remaining deployed consoles

Review:

Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

Known Issues:

None!

↧

Installing SQL 2016 Always On with Windows Server 2016 Core

May 28, 2017, 7:25 am

≫ Next: Stop Healthservice restarts in SCOM 2016

≪ Previous: UR3 for SCOM 2016 – Step by Step

This will be a simple walk through of installing two Windows Server 2016 Core servers, then installing SQL 2016, and setting up SQL Always On replication between them. This is meant for lab testing and getting familiar with the scenario. This setup is incredibly simple and straightforward, and fast. You can have this scenario up and running in just a few minutes.

First, deploy two VM’s. Nothing fancy (2GB RAM, 2 vCPU’s, 1 disk) is fine for a lab deployment.

I will name mine: SQLCORE1 and SQLCORE2.

Install Windows Server 2016, and choose the default option of Windows Server Core (no GUI):

When the install is complete, log in by creating a password. You are now ready to begin configuration.

From the command line, run PowerShell.

We will configure static IP’s and DNS on each server. Change these to match your lab:

New-NetIPAddress -InterfaceAlias "Ethernet" -IPAddress 10.10.10.60 -PrefixLength 24 -DefaultGateway 10.10.10.1
Set-DnsClientServerAddress -InterfaceAlias "Ethernet" -ServerAddresses 10.10.10.10,10.10.10.11

Next – we will join the domain and rename the computer when prompted. Type “sconfig” and press enter.

From the menu – choose “1”. Choose Domain, and provide your domain and domain credentials to be able to join.

When prompted, choose “Yes” to change the computer name. Provide the new computername you want for your SQL core servers. Mine are SQLCORE1 and SQLCORE2.

Reboot when prompted.

You must log in as the local administrator after the reboot. Then, type “logoff” and hit enter. Now you can log in as your domain admin account in the domain. Hit ESC to get back to “other user” and log in as a domain account.

Add the domain group for your SQL admin’s to the local administrators group at the command prompt:

net localgroup administrators /add OPSMGR\SQLAdmins

At this point you can log in as one of your SQL Administrator accounts, or continue the installation as your domain admin account.

Map a drive to your SQL 2016 installation media:

Net use Y: \\server\software\sql\2016\ENT

Install SQL server from the command line. There are two ways to install SQL. From a command line with options, or from an INI file. The INI file is much more powerful, but to keep things simple we will use a command line here. This basic install will cover the SQL database engine, the Full-text service, and set the SQL agent service to run as automatic startup. You will need to change your domain group for the SQL admins, and your SQL service account and password.

Setup.exe /qs /ACTION=Install /FEATURES=SQLEngine,FullText /INSTANCENAME=MSSQLSERVER /SQLSVCACCOUNT="OPSMGR\sqlsvc" /SQLSVCPASSWORD="password" /SQLSYSADMINACCOUNTS="OPSMGR\sqladmins" /AGTSVCACCOUNT="OPSMGR\sqlsvc" /AGTSVCPASSWORD="password" /AGTSVCSTARTUPTYPE=Automatic /TCPENABLED=1 /IACCEPTSQLSERVERLICENSETERMS

The SQL setup will begin and you will see some UI’s pop up along with progress in the command line window…. when complete you will be returned to a command prompt.

Now that SQL is installed – reboot each server.

Log back in with a domain account to continue setup and configuration.

Next, we will configure the firewall. We will open the necessary ports for SQL and Always On, and then enable the built in group rules for remote administration.

Run PowerShell.

Copy and paste the following to configure the firewall:

New-NetFirewallRule -Group "Custom SQL" -DisplayName "SQL Default Instance" -Direction Inbound –Protocol TCP –LocalPort 1433 -Action allow
New-NetFirewallRule -Group "Custom SQL" -DisplayName "SQL Admin Connection" -Direction Inbound –Protocol TCP –LocalPort 1434 -Action allow
New-NetFirewallRule -Group "Custom SQL" -DisplayName "SQL Always On VNN" -Direction Inbound -Protocol TCP -LocalPort 1764 -Action allow
New-NetFirewallRule -Group "Custom SQL" -DisplayName "SQL Always On AG Endpoint" -Direction Inbound -Protocol TCP -LocalPort 5022 -Action allow
Enable-NetFirewallRule -DisplayGroup "Remote Desktop"
Enable-NetFirewallRule -DisplayGroup "Remote Event Log Management"
Enable-NetFirewallRule -DisplayGroup "Remote Service Management"
Enable-NetFirewallRule -DisplayGroup "File and Printer Sharing"
Enable-NetFirewallRule -DisplayGroup "Performance Logs and Alerts"
Enable-NetFirewallRule -DisplayGroup "Remote Volume Management"
Enable-NetFirewallRule -DisplayGroup "Windows Firewall Remote Management"

Next, we will install the Windows Failover Cluster feature, a prerequisite for SQL Always On.

Install-WindowsFeature Failover-Clustering –IncludeManagementTools

Next – we will create a cluster. You can create a simple failover cluster between two nodes in a single line of PowerShell! You will need to change your cluster name, IP address, and node names to match your configuration. Only run this on ONE NODE!

(This step assumes you are running this as a domain admin, as this will create a computer account in the domain for the virtual cluster computer. If you do not wish to run this as a domain admin, you must pre-stage that account and assign permissions. See cluster documentation for this)

New-Cluster –Name SQLCORECL1 –StaticAddress 10.10.10.62 –Node SQLCORE1,SQLCORE2

You might see a warning at this point. That’s fine – they are likely just because we have a single NIC in each VM, and because we didn’t configure a witness.

Next, we need to enable each server to support SQL Always On. You will need to provide your SERVERNAME\INSTANCENAME. If you use the default instance like we did above, input just the servername. Do this on each node, but change the servername to match the correct node name you are running it on.

$ServerInstance = 'SQLCORE1' #this should be in format SERVERNAME\INSTANCENAME or just use servername for default instance
Enable-SqlAlwaysOn -ServerInstance $ServerInstance -Force

Lastly – we need to configure the Always On availability group. This is easiest done manually, via SQL management studio from a remote tools machine.

Launch SQL management studio and connect to the SQLCORE1 server:

First, we need to create a “dummy” database ONLY on SQLCORE1 which is required to configure and test Always On. Go to Databases, right click, and choose New Database. Name the Database “TESTDB” and click OK.

Before we can use a database in Always On, it must have at least one previous backup. Right click TESTDB, tasks, Back Up. Hit OK to accept defaults, and OK when backup is done.

Now expand “Always On High Availability”, and Right Click “Availability Groups” and choose “New Availability Group Wizard”

Assign an AG name. This isn’t terribly important. I will use “SQLCOREAG1” and click Next.

Select your TestDB and click Next.

Add a replica, and choose your other server, SQLCORE2. Check the boxes next to Auto failover and Synchronous commit on both servers.

On the Listener tab, create an Availability group listener. I will use “SQLCOREAGL1” We will use port 1764 (which we set up a previous Firewall rule for). You will need to scroll down to the bottom right and click “Add” to add in an IP address, and click Next.

(This step assumes you are running this as a domain admin, as this will create a computer account in the domain for the virtual availability group listener. If you do not wish to run this as a domain admin, you must pre-stage that account and assign permissions. See SQL Always On documentation for this)

Next, you will choose FULL synchronization, and provide a network share where the servers have read and write access to.

This will run the tests:

Click “Finish” and you should have success!

Go into SQL Management studio and look over your configuration:

↧

Stop Healthservice restarts in SCOM 2016

May 29, 2017, 1:43 pm

≫ Next: Don’t forget to License your SCOM 2016 deployments

≪ Previous: Installing SQL 2016 Always On with Windows Server 2016 Core

This is probably the single biggest issue I find in 100% of customer environments.

YOU ARE IMPACTED. Trust me.

SCOM monitors itself to ensure we aren’t using too much memory, or too many handles for the SCOM processes. If we detect that the SCOM agent is using an unexpected amount of memory or handles, we will forcibly KILL the agent, and restart it.

That sounds good right?

In theory, yes. In reality, however, this is KILLING your SCOM environment, and you probably aren’t aware it is even it is happening.

The problem?

1. The default thresholds are WAY out of touch with reality. They were set almost 10 years ago, when systems used a LOT less resources than modern operating systems today. This is MUCH worse if you choose to MULTIHOME. Multi-homed agents can use twice as many resources as non-multi-homed agents, and this restart can be issued from EITHER management group, but will affect BOTH.

2. We don’t generate an alert when this happens, so you are blind that this is impacting you.

We need to change these in the product. Until we do, a simple override is the solution.

Why is this so bad?

This is bad because of two impacts:

1. You are hurting your monitored systems by restarting them over and over, causing the startup scripts to run on loops and actually consuming additional resources. You are actually going periods of time without any monitoring because of this as well, because when the agent is killed and restarting, there is a period of time where the monitoring is unloaded.

2. You are filling SCOM with state change events. Every time all the monitors initialize, they send an updated “new” statechange event unpon initialization. You are hammering SCOM with useless state data.

What can I do about it?

Well, I am glad you asked! We simply need to override 4 monitors, to give them realistic agent thresholds, and set them to generate an informational alert. I will also include a view for these alerts so we can see if anyone is still generating them. I will wrap all this in a sample management pack for you to download.

In the console, go to Authoring, Monitors, and change scope to “Agent”

We will override each one:

Private bytes monitors should be set to a default threshold of 943718400 (triple the default of 300MB)

Handle Count monitors should be set to 30000 (the default of 6000 is WAY low)

Override Generate Alert to True (to generate alerts)

Override Auto-Resolve to False (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)

Override Alert severity to Information (to keep from ticketing on these events)

Override EACH monitor, “all objects of class” and choose “Agent” class.

NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.

This is a good configuration:

Ok – those are much more reasonable defaults.

What else should I do?

Create an alert view that shows alerts with name “Microsoft.SystemCenter.Agent.%”

This will show you if you STILL have some agents restarting on a regular basis. You should review the ones with high repeat counts on a weekly basis, and adjust their agent specific thresholds, or investigate why they are consuming so much, so often. An occasional agent restart (one or less per day) is totally fine and probably not worth the time to investigate.

I am including a management pack with these overrides, and the alert view, and you can download it below if you prefer to to make your own.

Download:

https://gallery.technet.microsoft.com/SCOM-Agent-Threshold-b96c4d6a

↧

Don’t forget to License your SCOM 2016 deployments

June 29, 2017, 8:45 am

≫ Next: SCOM 2012 and 2016 Unsealed MP Backup

≪ Previous: Stop Healthservice restarts in SCOM 2016

Just like previous versions of Operations Manager, all SCOM deployments are installed as “Evaluation Version” which is a 180 trial. You DON’T want to forget about this and have your production and lab deployments time-bomb on you down the road.

To see your current license, in PowerShell on a SCOM server:

Get-SCOMManagementGroup | ft skuforlicense, timeofexpiration -a

In order to set your license – you just need to run the Set-SCOMLicense cmdlet. This is documented here:

https://docs.microsoft.com/en-us/powershell/systemcenter/systemcenter2016/operationsmanager/vlatest/set-scomlicense

Two things:

1. You need to get your license key, from whomever keeps that information for your company.

2. You MUST run this cmdlet in a PowerShell sessions launched “As an administrator” as this will need access to write to the registry.

Run this command ONE time on ANY management server…..

Set-SCOMLicense -ProductId ‘99999-99999-99999-99999-99999’

…… where you change the example key above to your key.

You should restart the PowerShell session, then run the command to get the license again.

(Note: You might have to restart you management server services or reboot the management server before you see this take effect)

↧

SCOM 2012 and 2016 Unsealed MP Backup

July 7, 2017, 1:45 pm

≫ Next: Document your SCOM RunAs Account and Profiles script

≪ Previous: Don’t forget to License your SCOM 2016 deployments

This is a management pack that I use in every customer environment. You *NEED* to backup your unsealed MP’s. This will allow you to quickly recover from a mistake, without having to restore your databases from a backup. Over the years, I have seen many customers accidentally delete workflows, mess up their RunAs accounts, break AD integration, or break their notifications. All of these things are stored in unsealed MP’s. We really need to back them up, with a daily history. The amount of space needed is very small.

This is an updated version of the community MP from SystemCenterCentral.com written by Neale Brown, Derek Harkin, Pete Zerger and Tommy Gunn, located at: http://www.systemcentercentral.com/pack-catalog/backup-unsealed-management-packs-opsmgr-2012-edition/

It contains a single rule which now targets the “All Management Servers Resource Pool” which will give this workflow high availability.

The rule runs once per day (24 hours) and executes a PowerShell script.

You can edit the Write Action configuration for the number of days, and the share location, or local directory:

This will create these directories if they do not exist, either a local path on the management server, or on a share you provide as above.

It will log events to the SCOM event log for tracking:

This script will run on one of your SCOM management servers, and will execute as the SCOM Management Server Action Account by default. If you want to specify a specific account, there is a RunAs profile included. You will need to use an account that has SCOM admins rights to the SDK, and read/write access to the directory or share that you choose.

Changes I made:

Supports multiple management groups exporting to the same share path
Add start and completion logging with runtime and whoami.
Make SCOM management group connection more reliable with debug logging

You can download the MP here:

https://gallery.technet.microsoft.com/SCOM-2012-and-2016-2ccc45c0

↧

Document your SCOM RunAs Account and Profiles script

July 27, 2017, 6:18 pm

≫ Next: What SQL maintenance should I perform on my SCOM 2016 databases?

≪ Previous: SCOM 2012 and 2016 Unsealed MP Backup

This script will document your SCOM RunAs accounts, and any profiles they are associated to. It will output as a CSV file.

This is handy for collecting data for change management, making sure multiple management groups have the same configuration, and ensuring you have documented accounts prior to a major upgrade.

The script is based on the work if Dirk Brinkmann's 2012 script - located here: https://gallery.technet.microsoft.com/Listing-SCOM-2012-R2-24be56b1

Here is a sample of the output:

Download here:

https://gallery.technet.microsoft.com/Document-SCOM-RunAs-cb64d461

↧

What SQL maintenance should I perform on my SCOM 2016 databases?

August 3, 2017, 8:49 pm

≫ Next: Reinstalling your SCOM agents with the NOAPM switch

≪ Previous: Document your SCOM RunAs Account and Profiles script

***Note – The products and recommendations have changed over the years, so what applied to previous versions does not really apply today. Make sure you read the entire article!

The SQL instances and databases deployed to support SCOM, generally fall into one of two categories:

1. The SQL server is managed by a DBA team within the company, and that teams standard will be applied.

2. The SCOM team fully owns and supports the SQL servers.

Most SQL DBA's will set up some pretty basic default maintenance on all SQL DB's they support. This often includes, but is not limited to:

CHECKDB (to look for DB errors and report on them)
UPDATE STATISTICS (to boost query performance)
REINDEX (to rebuild the table indexes to boost performance)
BACKUP

SQL DBA's might schedule these to run via the SQL Agent to execute nightly, weekly, or some combination of the above depending on DB size and requirements.

On the other side of the coin.... in some companies, the SCOM team installs and owns the SQL server.... and they don't do ANY default maintenance to SQL. Because of this all too common scenario - a focus in SCOM was to have the Ops DB and Datawarehouse DB to be somewhat self-maintaining.... providing a good level of SQL performance whether or not any default maintenance was being done.

Operational Database:

Daily jobs that run for the OpsDB:

12:00 AM – Partitioning and Grooming
2:00 AM – Discovery Data Grooming
2:30 AM – Optimize Indexes
4:00 AM – Alert auto-resolution

Reindexing is already taking place against the OperationsManager database for some of the tables (but not all, and this is important to understand!). This is built into the product. What we need to ensure - is that any default DBA maintenance tasks are not conflicting with our built-in maintenance, and our built-in schedules:

There is a rule in OpsMgr that is targeted at the All Management Servers Resource Pool:

The rule executes the "p_OptimizeIndexes" stored procedure, every day at 2:30AM.

This rule cannot be changed or modified. Therefore - we need to ensure there is not other SQL maintenance (including backups) running at 2:30AM, or performance could be impacted.

If you want to view the built-in UPDATE STATISTICS and REINDEX jobs history - just run the following queries:

SELECT TableName,
OptimizationStartDateTime,
OptimizationDurationSeconds,
BeforeAvgFragmentationInPercent,
AfterAvgFragmentationInPercent,
OptimizationMethod
FROM DomainTable dt
inner join DomainTableIndexOptimizationHistory dti
on dt.domaintablerowID = dti.domaintableindexrowID
ORDER BY OptimizationStartDateTime DESC

SELECT TableName,
StatisticName,
UpdateStartDateTime,
UpdateDurationSeconds
FROM DomainTable dt
inner join DomainTableStatisticsUpdateHistory dti
on dt.domaintablerowID = dti.domaintablerowID
ORDER BY UpdateStartDateTime DESC

Take note of the update/optimization duration seconds column. This will show you how long your maintenance is typically running. In a healthy environment these should not take very long.

In general - we would like the "Scan density" to be high (Above 80%), and the "Logical Scan Fragmentation" to be low (below 30%). What you might find... is that *some* of the tables are more fragmented than others, because our built-in maintenance does not reindex all tables. Especially tables like the raw perf, event, and localizedtext tables.

This brings us to the new perspectives in SCOM 2016, especially when used with SQL 2016.

In SQL 2016, some changes were made to optimize performance, especially when using new storage subsystems that leverage new disks like SSD. The net effect of these changes, on SCOM databases, is that they will consume much more space in the database, than when using SQL 2014 and previous. The reason for this is deeply technical, and I will cover this later. But what you need to understand as a SCOM owner, is that the sizing guidance will not match up with previous versions of SQL, compared to SQL 2016. This isn't a bad thing, you just need to make some minor changes to counteract this.

SCOM inserts performance and event data into the SCOM database via something called BULK INSERT. When we bulk insert the data, SCOM is designed to use a fairly small batch size by default. In SQL 2016, this creates lots of unused reserved space in the database, that does not get reused. If you review a large table query – you will observe this as “unused” space.

Note in the above graphic – the unused space is over 5 TIMES the space used by actual data!

If you want to read more about this – my colleague Dirk Brinkmann worked on discovering the root cause of this issue, and has a great deep dive on this:

https://blogs.technet.microsoft.com/germanageability/2017/07/07/possible-increased-unused-disk-space-when-running-scom-2016-on-sql2016/

The SQL server team also recently added a blog post describing the issue in depth:

https://blogs.msdn.microsoft.com/sql_server_team/sql-server-2016-minimal-logging-and-impact-of-the-batchsize-in-bulk-load-operations/

Do not despair. In order to clean up the unused space, a simple Index Rebuild or at a minimum Index Reorganize for each table is all that is needed. HOWEVER – these perf tables are NOT indexed by default! This was likely done when SCOM was designed, because these are not static tables, they contain transient data in the OpsDB, that is only held for a short amount of time. The long term data is moved into the Data Warehouse DB, where it is aggregated into hourly and daily tables – and those are indexed via built in maintenance.

To resolve this, and likely improve performance of SCOM – I recommend that each SCOM customer set up SQL agent jobs, that handles Index maintenance for the entire OpsDB, once a day. I’d say given the other schedules, a start time between 3:00 AM and 6:00 AM would likely be a good time for this maintenance. That lets the built in maintenance run, and wont conflict with too much. You should try and avoid having anything running at 4:00 AM because of the Alert Auto Resolution. We don’t want any blocking going on for that activity.

There are other performance benefits to reindexing the entire database, as many new visualization tables have been added over time, and these don’t get his by our built in maintenance.

A great set of maintenance TSQL scripts for Agent Jobs plan can be found at https://ola.hallengren.com/

Specifically the index maintenance plan at https://ola.hallengren.com/sql-server-index-and-statistics-maintenance.html

This is a well thought out maintenance plan, that analyzes the tables, and chooses to Reindex or Reorganize based on fragmentation thresholds, skipping tables if not needed at all. The first time you index the entire DB, it may take a long time. Once you set this up to run daily, it will only be optimizing the daily perf and event tables for the most part, which will be a single table containing one days worth.

After a reindex – I have freed up a TON of space. Here is the same DB:

Notice the huge decrease in “unused space”. Additionally, the total space reserved in my perf tables is now consuming less than one fifth the amount of space in the database it was consuming previously. This leaves you with a smaller footprint, and better performance. I strongly recommend you set this up or check with your DBA’s to ensure it is happening.

Data Warehouse Database:

The data warehouse DB is also self-maintaining. This is called out by a rule "Standard Data Warehouse Data Set maintenance rule" which is targeted to the "Standard Data Set" object type. This stored procedure is called on the data warehouse every 60 seconds . It performs many, many tasks, of which Index optimization is but one.

This SP calls the StandardDatasetOptimize stored procedure, which handles any index operations.

To examine the index and statistics history - run the following query for the Alert, Event, Perf, and State tables:

select basetablename,
optimizationstartdatetime,
optimizationdurationseconds,
beforeavgfragmentationinpercent,
afteravgfragmentationinpercent,
optimizationmethod,
onlinerebuildlastperformeddatetime
from StandardDatasetOptimizationHistory sdoh
inner join StandardDatasetAggregationStorageIndex sdasi
on sdoh.StandardDatasetAggregationStorageIndexRowId = sdasi.StandardDatasetAggregationStorageIndexRowId
inner join StandardDatasetAggregationStorage sdas
on sdasi.StandardDatasetAggregationStorageRowId = sdas.StandardDatasetAggregationStorageRowId
ORDER BY OptimizationStartDateTime DESC

In the data warehouse - we can see that all the necessary tables are being updated and reindexed as needed. When a table is 10% fragmented - we reorganize. When it is 30% or more, we rebuild the index.

Since we run our maintenance every 60 seconds, and only execute maintenance when necessary, there is no "set window" where we will run our maintenance jobs. This means that if a DBA team also sets up a UPDATE STATISTICS or REINDEX job - it can conflict with our jobs and execute concurrently.

I will caveat the above statement with from findings from the field. We have some new visualization tables and management type tables that do not get optimized, and this can lead to degraded performance. An example of that is http://www.theneverendingjourney.com/scom-2012-poor-performance-executing-sdk-microsoft_systemcenter_visualization_library_getaggregatedperformanceseries/ They found that running Update Statistics every hour was beneficial to reducing the CPU consumption of the warehouse. If you manage a very large SCOM environment, this might be worth investigating. I have seen many support cases which resulted in a manual run of Update Statistics to resolve an issue with performance.

For the above reasons, I would be careful with any maintenance jobs on the Data Warehouse DB, beyond a CHECKDB and a good backup schedule. UNLESS – you are going to analyze the data, determine which areas aren't getting index maintenance, or determine how out of date your statistics get. Then ensure any custom maintenance wont conflict with built-in maintenance.

Lastly - I'd like to discuss the recovery model of the SQL database. We default to "simple" for all our DB's. This should be left alone.... unless you have *very* specific reasons to change this. Some SQL teams automatically assume all databases should be set to "full" recovery model. This requires that they back up the transaction logs on a very regular basis, but give the added advantage of restoring up to the time of the last t-log backup. For OpsMgr, this is of very little value, as the data changing on an hourly basis is of little value compared to the complexity added by moving from simple to full. Also, changing to full will mean that your transaction logs will only checkpoint once a t-log backup is performed. What I have seen, is that many companies aren't prepared for the amount of data written to these databases.... and their standard transaction log backups (often hourly) are not frequent enough (or tlogs BIG enough) to keep them from filling. The only valid reason to change to FULL, in my opinion, is when you are using an advanced replication strategy, like SQL Always On, or log shipping, which requires full recovery model. When in doubt - keep it simple.

P.S.... The Operations Database needs 50% free space at all times. This is for growth, and for re-index operations to be successful. This is a general supportability recommendation, but the OpsDB will alert when this falls below 40%.

For the Data warehouse.... we do not require the same 50% free space. This would be a tremendous requirement if we had a multiple-terabyte database!

Think of the data warehouse to have 2 stages... a "growth" stage (while it is adding data and not yet grooming much (haven't hit the default 400 days retention) and a "maturity stage" where agent count is steady, MP's are not changing, and the grooming is happening because we are at 400 days retention. During "growth" we need to watch and maintain free space, and monitor for available disk space. In "maturity" we only need enough free space to handle our index operations. when you start talking 1 Terabyte of data.... that means 500GB of free space, which is expensive, and. If you cannot allocate it.... then just allow auto-grow and monitor the database.... but always plan for it from a volume size perspective.

For transaction log sizing - we don't have any hard rules. A good rule of thumb for the OpsDB is ~20% to 50% of the database size.... this all depends on your environment. For the Data warehouse, it depends on how large the warehouse is - but you will probably find steady state to require somewhere around 10% of the warehouse size or less. When we are doing any additional grooming of an alert/event/perf storm.... or changing grooming from 400 days to 300 days - this will require a LOT more transaction log space - so keep that in mind as your databases grow.

Summary: (or TL;DR; )

1. Set up a nightly Reindex job on your SCOM Operations Database for best performance and to reduce significant wasted space on disk.

2. You can do the same for the DW, but be prepared to put in the work to analyze the benefits if you do. Running a regular (multiple times a day) Update Statistics has proven helpful to some customers.

3. Keep your DB recovery model in SIMPLE mode, unless you are using AlwaysOn replication.

4. Ensure you pre-size your databases and logs so they are not always auto-growing, have plenty of free space as required to be supported.

↧

Reinstalling your SCOM agents with the NOAPM switch

August 5, 2017, 6:53 pm

≫ Next: Updated SQL RunAs Addendum Configuration MPs

≪ Previous: What SQL maintenance should I perform on my SCOM 2016 databases?

This one comes from collaboration with my colleague Brian Barrington.

Because of the issues with SCOM 2016 and the default APM modules impacting IIS and SharePoint servers….. (Read more about that issue HERE, HERE, and HERE)

Brian was looking for a way to easily remove the APM components from the deployed agents with minimal impact.

Normally, the guidance would be to uninstall the SCOM agent, then reinstall it from a command line installation using the NOAPM=1 command line parameter. That could be a challenging task if you have hundreds or thousands of agents!

His idea? Use my SCOM Agent Tasks MP here: https://blogs.technet.microsoft.com/kevinholman/2017/05/09/agent-management-pack-making-a-scom-admins-life-a-little-easier/

It has a class property in the state view called “APM Installed” to help you see which agents still have the APM components installed (which are installed by default)

It has a task called “execute any PowerShell”

In the task – Override to provide the command you want to run – such as:

msiexec.exe /fvomus "\\server\share\agents\scom2016\x64\MOMagent.msi" NOAPM=1

You just need to place the MOMAgent.msi file on a share that your domain computer accounts would have access to.

This performs a lightweight upgrade/installation of the agent, but only changes the switch “NOAPM=1” which will result in leaving all other settings alone, and only removing the APM service and components!

We have gotten good feedback on the success of this process across hundreds of agents in a short time frame.

Removing the APM MP’s

On another note – if you have no plans to use the APM feature in SCOM – you should consider removing those MP’s which get imported by default. They discvoer by default a LOT of instances of sites, services, and instances of classes where APM components are installed on the agents.

MP’s to remove in SCOM 2016:

Microsoft.SystemCenter.DataWarehouse.ApmReports.Library (Operations Manager APM Reports Library)
Microsoft.SystemCenter.Apm.Web (Operations Manager APM Web)
Microsoft.SystemCenter.Apm.Wcf (Operations Manager APM WCF Library)
Microsoft.SystemCenter.Apm.NTServices (Operations Manager APM Windows Services)
Microsoft.SystemCenter.Apm.Infrastructure.Monitoring (Operations Manager APM Infrastructure Monitoring)
Microsoft.SystemCenter.Apm.Library (Operations Manager APM Library)
Microsoft.SystemCenter.Apm.Infrastructure (Operations Manager APM Infrastructure)

All of the above can be deleted. However – in order to delete the Microsoft.SystemCenter.Apm.Infrastructure MP, you will need to remove a RunAs account profile association, then clean up the SecureReference library manually by deleting the reference.

In the Admin pane > Run As Configuration > Profiles, in the Data Warehouse Account. On the RunAs accounts, you will need to remove the Operations Manager APM Data Transfer Service:

Then – manually export the Microsoft.SystemCenter.SecureReferenceOverride MP, and edit it using your favorite XML editor. (Make a Backup copy of this FIRST!!!!!)

Delete the reference to the Microsoft.SystemCenter.Apm.Infrastructure MP.

Save this, then reimport the Microsoft.SystemCenter.SecureReferenceOverride MP.

At this point you can delete the final APM MP – Microsoft.SystemCenter.Apm.Infrastructure (Operations Manager APM Infrastructure)

Deleting that MP with manual edits too scary for you?

At a bare minimum – if you are not using the APM feature – you should disable the discoveries:

Then run Remove-SCOMDisabledClassInstance in your SCOM Command Shell, which will remove all these discovered instances that you don’t use.

↧

Updated SQL RunAs Addendum Configuration MPs

August 11, 2017, 10:30 am

≫ Next: Creating a SCOM Service Monitor that allows overrides for Interval Frequency and Samples

≪ Previous: Reinstalling your SCOM agents with the NOAPM switch

Just a quick note to let you know I updated these MP’s if you use them:

https://blogs.technet.microsoft.com/kevinholman/2016/08/25/sql-mp-run-as-accounts-no-longer-required-2/

Updates include:

1. Disabled the Monitor for SysAdmin check by default – you will need to enable this is you want to use it. I have been recommending to use Low Priv since that is more secure so this monitor will be disabled by default now.

2. Updated the Low Priv tasks for SQL 2005 – 2016. Made the tasks more reliable. If the task encounters an error, it will not complete all the steps for configuring low prive, so it is important to review the output when you config your SQL servers the first time. Changes made to skip databases in read only mode in all versions, and to be more reliable for SQL 2005 and SQL 2008.

3. Updated version to 7.7.31.0 to align with current shipping SQL MP’s.

↧

Creating a SCOM Service Monitor that allows overrides for Interval Frequency and Samples

August 12, 2017, 1:39 pm

≫ Next: QuickTip: Disabling workflows to optimize for large environments

≪ Previous: Updated SQL RunAs Addendum Configuration MPs

The “built in” service monitor in SCOM is hard-coded for how often it checks the service state, and how many service checks have to return “not running” before it alarms. This is a bit unfortunate, as customers would often want to customize this. This article will explain how.

All the built in service monitoring uses Monitors that reference the Microsoft.Windows.CheckNTServiceStateMonitorType monitortype, which is in the Microsoft.Windows.Library mp.

This MonitorType has a hard coded definition with <Frequency>30</Frequency> and <MatchCount>2</MatchCount>. This means by default, monitors that use this will inspect the service state every 30 seconds, and alarm when it is not running after two consecutive checks. However – the challenge is – Microsoft did not expose these values as override-able parameters.

What if you want to check the service every 60 seconds, and alarm only after it has been consistently down for 15 samples (15 consecutive minutes)? We can do that. We have the tools. Smile

Basically – we need to create our own MonitorType –which will expose these. Here is an example:

     <UnitMonitorType ID="Contoso.Demo.Service.MonitorType" Accessibility="Internal">
        <MonitorTypeStates>
          <MonitorTypeState ID="Running" NoDetection="false" />
          <MonitorTypeState ID="NotRunning" NoDetection="false" />
        </MonitorTypeStates>
        <Configuration>
          <xsd:element name="ComputerName" type="xsd:string" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="ServiceName" type="xsd:string" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="IntervalSeconds" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="CheckStartupType" minOccurs="0" maxOccurs="1" type="xsd:string" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
          <xsd:element name="Samples" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
        </Configuration>
        <OverrideableParameters>
          <OverrideableParameter ID="IntervalSeconds" Selector="$Config/IntervalSeconds$" ParameterType="int" />
          <OverrideableParameter ID="CheckStartupType" Selector="$Config/CheckStartupType$" ParameterType="string" />
          <OverrideableParameter ID="Samples" Selector="$Config/Samples$" ParameterType="int" />
        </OverrideableParameters>
        <MonitorImplementation>
          <MemberModules>
            <DataSource ID="DS" TypeID="Windows!Microsoft.Windows.Win32ServiceInformationProvider">
              <ComputerName>$Config/ComputerName$</ComputerName>
              <ServiceName>$Config/ServiceName$</ServiceName>
              <Frequency>$Config/IntervalSeconds$</Frequency>
              <DisableCaching>true</DisableCaching>
              <CheckStartupType>$Config/CheckStartupType$</CheckStartupType>
            </DataSource>
            <ProbeAction ID="Probe" TypeID="Windows!Microsoft.Windows.Win32ServiceInformationProbe">
              <ComputerName>$Config/ComputerName$</ComputerName>
              <ServiceName>$Config/ServiceName$</ServiceName>
            </ProbeAction>
            <ConditionDetection ID="ServiceRunning" TypeID="System!System.ExpressionFilter">
              <Expression>
                <Or>
                  <Expression>
                    <And>
                      <Expression>
                        <SimpleExpression>
                          <ValueExpression>
                            <Value Type="String">$Config/CheckStartupType$</Value>
                          </ValueExpression>
                          <Operator>NotEqual</Operator>
                          <ValueExpression>
                            <Value Type="String">false</Value>
                          </ValueExpression>
                        </SimpleExpression>
                      </Expression>
                      <Expression>
                        <SimpleExpression>
                          <ValueExpression>
                            <XPathQuery Type="Integer">Property[@Name='StartMode']</XPathQuery>
                          </ValueExpression>
                          <Operator>NotEqual</Operator>
                          <ValueExpression>
                            <Value Type="Integer">2</Value>  <!-- 0=BootStart 1=SystemStart 2=Automatic 3=Manual 4=Disabled -->
                          </ValueExpression>
                        </SimpleExpression>
                      </Expression>
                    </And>
                  </Expression>
                  <Expression>
                    <SimpleExpression>
                      <ValueExpression>
                        <XPathQuery Type="Integer">Property[@Name='State']</XPathQuery>
                      </ValueExpression>
                      <Operator>Equal</Operator>
                      <ValueExpression>
                        <Value Type="Integer">4</Value>  <!-- 0=Unknown 1=Stopped 2=StartPending 3=StopPending 4=Running 5=ContinuePending 6=PausePending 7=Paused 8=ServiceNotFound 9=ServerNotFound -->
                      </ValueExpression>
                    </SimpleExpression>
                  </Expression>
                </Or>
              </Expression>
            </ConditionDetection>
            <ConditionDetection ID="ServiceNotRunning" TypeID="System!System.ExpressionFilter">
              <Expression>
                <And>
                  <Expression>
                    <Or>
                      <Expression>
                        <SimpleExpression>
                          <ValueExpression>
                            <XPathQuery Type="Integer">Property[@Name='StartMode']</XPathQuery>
                          </ValueExpression>
                          <Operator>Equal</Operator>
                          <ValueExpression>
                            <Value Type="Integer">2</Value>  <!-- 0=BootStart 1=SystemStart 2=Automatic 3=Manual 4=Disabled -->
                          </ValueExpression>
                        </SimpleExpression>
                      </Expression>
                      <Expression>
                        <And>
                          <Expression>
                            <SimpleExpression>
                              <ValueExpression>
                                <Value Type="String">$Config/CheckStartupType$</Value>
                              </ValueExpression>
                              <Operator>Equal</Operator>
                              <ValueExpression>
                                <Value Type="String">false</Value>
                              </ValueExpression>
                            </SimpleExpression>
                          </Expression>
                          <Expression>
                            <SimpleExpression>
                              <ValueExpression>
                                <XPathQuery Type="Integer">Property[@Name='StartMode']</XPathQuery>
                              </ValueExpression>
                              <Operator>NotEqual</Operator>
                              <ValueExpression>
                                <Value Type="Integer">2</Value>  <!-- 0=BootStart 1=SystemStart 2=Automatic 3=Manual 4=Disabled -->
                              </ValueExpression>
                            </SimpleExpression>
                          </Expression>
                        </And>
                      </Expression>
                    </Or>
                  </Expression>
                  <Expression>
                    <SimpleExpression>
                      <ValueExpression>
                        <XPathQuery Type="Integer">Property[@Name='State']</XPathQuery>
                      </ValueExpression>
                      <Operator>NotEqual</Operator>
                      <ValueExpression>
                        <Value Type="Integer">4</Value>  <!-- 0=Unknown 1=Stopped 2=StartPending 3=StopPending 4=Running 5=ContinuePending 6=PausePending 7=Paused 8=ServiceNotFound 9=ServerNotFound -->
                      </ValueExpression>
                    </SimpleExpression>
                  </Expression>
                </And>
              </Expression>
              <SuppressionSettings>
                <MatchCount>$Config/Samples$</MatchCount>
              </SuppressionSettings>
            </ConditionDetection>
          </MemberModules>
          <RegularDetections>
            <RegularDetection MonitorTypeStateID="Running">
              <Node ID="ServiceRunning">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
            <RegularDetection MonitorTypeStateID="NotRunning">
              <Node ID="ServiceNotRunning">
                <Node ID="DS" />
              </Node>
            </RegularDetection>
          </RegularDetections>
          <OnDemandDetections>
            <OnDemandDetection MonitorTypeStateID="Running">
              <Node ID="ServiceRunning">
                <Node ID="Probe" />
              </Node>
            </OnDemandDetection>
            <OnDemandDetection MonitorTypeStateID="NotRunning">
              <Node ID="ServiceNotRunning">
                <Node ID="Probe" />
              </Node>
            </OnDemandDetection>
          </OnDemandDetections>
        </MonitorImplementation>
      </UnitMonitorType>

Essentially – we have taken the hard-coded values, and changed them to allow a $Config/Value$ passed parameter. This will allow the monitor to PASS this value to the MonitorType, and be used in the DataSource or ConditionDetection. Even if you don’t fully understand that, it’s ok…. because I will be wrapping all this up in a consumable VSAE Fragment that is easy to implement.

The changes made to allow data to be passed in were:

<Frequency>$Config/IntervalSeconds$</Frequency>
<MatchCount>$Config/Samples$</MatchCount>

In the <Configuration> section we added:

<xsd:element name="IntervalSeconds" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />
<xsd:element name="Samples" type="xsd:integer" xmlns:xsd="http://www.w3.org/2001/XMLSchema" />

In the <OverrideableParameters> section – we added:

In the DataSource – one new value that should be added when using Microsoft.Windows.Win32ServiceInformationProvider and multiple runs, is the following:

This is very important, as this will cause the datasource to output data every time, even if nothing has changed. We need this for the number of samples (MatchCount) to work as desired.

Now that we have this new MonitorType – we can reference it in our own Monitors. Here is an example of a Monitor using this:

      <UnitMonitor ID="Contoso.Demo.Spooler.Service.Monitor" Accessibility="Public" Enabled="true" Target="Windows!Microsoft.Windows.Server.OperatingSystem" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Contoso.Demo.Service.MonitorType" ConfirmDelivery="false">
        <Category>AvailabilityHealth</Category>
        <AlertSettings AlertMessage="Contoso.Demo.Spooler.Service.Monitor.Alert.Message">
          <AlertOnState>Error</AlertOnState>
          <AutoResolve>true</AutoResolve>
          <AlertPriority>Normal</AlertPriority>
          <AlertSeverity>Error</AlertSeverity>
          <AlertParameters>
            <AlertParameter1>$Data/Context/Property[@Name='Name']$</AlertParameter1>
            <AlertParameter2>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</AlertParameter2>
          </AlertParameters>
        </AlertSettings>
        <OperationalStates>
          <OperationalState ID="Running" MonitorTypeStateID="Running" HealthState="Success" />
          <OperationalState ID="NotRunning" MonitorTypeStateID="NotRunning" HealthState="Error" />
        </OperationalStates>
        <Configuration>
          <ComputerName />
          <ServiceName>spooler</ServiceName>
          <IntervalSeconds>30</IntervalSeconds>
          <CheckStartupType>true</CheckStartupType>
          <Samples>2</Samples>
        </Configuration>
      </UnitMonitor>

Once you implement this Monitor – you will see the new options exposed in overrides:

So the key takeaways are:

The built in service monitoring does not allow for configurable Interval and Sample count.
We can customize this using a custom MonitorType that allows for these variables to be passed in.
Using the Microsoft.Windows.Win32ServiceInformationProvider we MUST set <DisableCaching>true</DisableCaching>

This example has been added to my Fragment Library for you to download at:

https://gallery.technet.microsoft.com/SCOM-Management-Pack-VSAE-2c506737

(see: Monitor.Service.WithAlert.FreqAndSamples.mpx)

To learn more about using MP Fragments, and how EASY they are to use with Visual Studio:

https://blogs.technet.microsoft.com/kevinholman/2016/06/04/authoring-management-packs-the-fast-and-easy-way-using-visual-studio/

https://www.youtube.com/watch?v=9CpUrT983Gc

To make using fragments REALLY EASY, using Silect MP Author Pro, watch the video:

https://blogs.technet.microsoft.com/kevinholman/2017/03/22/management-pack-authoring-the-really-fast-and-easy-way-using-silect-mp-author-and-fragments/

https://www.youtube.com/watch?v=E5nnuvPikFw

Smile

↧

QuickTip: Disabling workflows to optimize for large environments

August 15, 2017, 3:05 pm

≫ Next: Adding direct agent OMS Workspace and Proxy via PowerShell

≪ Previous: Creating a SCOM Service Monitor that allows overrides for Interval Frequency and Samples

One of the coolest things about SCOM is how much monitoring you get out of the box.

That said, one of the biggest performance impacts to SCOM is all the monitoring out of the box, plus all the Management Packs you import. This has a cumulative effect, and over time, can impact the speed of the console, because of all the activity happening.

I have long stated, the biggest performance relief you can give to SCOM, is to reduce the number of workflows, reduce the classes and relationships, and keep things simple.

SCOM 2007 shipped back in March 2007. In 10 years, We have continuously added management packs to a default installation of SCOM, and continuously added workflows to the existing MP’s.

For the most part – this is good. These packs add more and more monitoring and capabilities “out of the box”. However, in many cases, they can also add load to the environment. They discover class instances, relationships, add state calculation, etc. In small SCOM environments (under 1000 agents) this will have very little impact. But at large enterprise scale, every little thing counts.

I have already written about some of the optional things you can consider (IF you don’t use the features), such as removing the APM MP’s, and removing the Advisor MP’s.

Here is one I came across today with a customer:

I noticed on the server that hosts the “All Management Servers Resource Pool” we have some out of the box PowerShell script based rules that were timing out after 300 seconds, and running every 15 minutes:

Collect Agent Health States (ManagementGroupCollectionAgentHealthStatesRule)

Collect Management Group Active Alerts Count (ManagementGroupCollectionAlertsCountRule)

These scripts do things like “Get-SCOMAgent” and “Get-SCOMAlert”. They were timing out, running constantly for 5 minutes, then getting killed by the timeout limit, then starting over again. This kind of thing will have significant impact on SQL blocking, SDK utilization, and overall performance.

Now, in small environments, this isn’t a big deal, and these will return results quickly with little impact. However, in a VERY large environment, Get-SCOMAgent can take 10 minutes or more just to return the data!!!! If you have hundreds of thousands of open alerts, it can take just as long to run the Alert SDK queries as well.

The only thing these two rules are used for is to populate a SCOM Health dashboard – and these are of little value:

I recommend that larger environments disable these two rules….. as they will be very resource intensive for very minimal value. If you feel like you like to keep them, then override them to 86400 seconds, and set a sync time to run each at slightly different times, off peak, like 23:00 (11pm), and set the timeout to 600 seconds. If it cannot complete in 10 minutes, then disable them….. also – stagger the sync time for the other rule to begin at 23:20 (11:20pm) so they aren't both running at the time time.

Additionally, in this same MP (Microsoft.SystemCenter.OperationsManager.SummaryDashboard) there are two discoveries.

Collect Agent Versions (ManagementGroupDiscoveryAgentVersions)

Collect agent configurations (ManagementGroupDiscoveryAgentConfiguration)

These discoveries run once per hour, and also run things like Get-SCOMAgent – which is bad for large environments, especially with that frequency.

The only thing they do is populate this dashboard:

I rarely ever see this being used and recommend large environments disable these as well.

Speed up that SCOM deployment!

↧

Adding direct agent OMS Workspace and Proxy via PowerShell

August 16, 2017, 3:14 pm

≫ Next: Monitoring for Time Drift in your enterprise

≪ Previous: QuickTip: Disabling workflows to optimize for large environments

We have some REALLY good documentation on this here: https://docs.microsoft.com/en-us/azure/log-analytics/log-analytics-windows-agents

PowerShell and the Agent Scripting Objects make it really easy to control the OMS direct agent configuration on thousands of agents, using SCOM.

Here are some PowerShell examples:

# Load agent scripting object
$AgentCfg = New-Object -ComObject AgentConfigManager.MgmtSvcCfg

# Get all AgentCfg methods and properties
$AgentCfg | Get-Member

# Check to see if this agent supports OMS
$AgentSupportsOMS = $AgentCfg | Get-Member -Name 'GetCloudWorkspaces'

# Get all configured OMS Workspaces
$AgentCfg.GetCloudWorkspaces()

# Add OMS Workspace
$AgentCfg.AddCloudWorkspace($WorkspaceID,$WorkspaceKey)

# Remove OMS Workspace
$AgentCfg.RemoveCloudWorkspace($WorkspaceID)

# Get the OMS proxy if configured
$AgentCfg.proxyUrl

# Set a proxy for the OMS Agent
$AgentCfg.SetProxyUrl($ProxyURL)

I added these tasks to ADD and REMOVE OMS workspaces from the MMA, in the latest version of the SCOM Management helper pack:

https://blogs.technet.microsoft.com/kevinholman/2017/05/09/agent-management-pack-making-a-scom-admins-life-a-little-easier/

↧

Monitoring for Time Drift in your enterprise

August 26, 2017, 5:32 am

≫ Next: How to create a SCOM group from an Active Directory Computer Group

≪ Previous: Adding direct agent OMS Workspace and Proxy via PowerShell

Time sync is critical in today’s networks. Experiencing time drift across devices can cause authentication breakdowns, reporting miscalculations, and wreak havoc on interconnected systems. This article shows a demo management pack to monitor for time sync across your Windows devices.

The basic idea was – to monitor all systems and compare their local time, against a target reference time server, using W32Time. Here is the command from the PowerShell:

$cmd = w32tm /stripchart /computer:$RefServer /dataonly /samples:$Samples

The script will take two parameters, the reference server and the threshold for how much time drift is allowed.

Here is the PowerShell script:

#=================================================================================
#  Time Skew Monitoring Script
#  Kevin Holman
#  Version 1.0
#=================================================================================
param([string]$RefServer,[int]$Threshold)


#=================================================================================
# Constants section - modify stuff here:

# Assign script name variable for use in event logging
$ScriptName = "Demo.TimeDrift.PA.ps1"
# Set samples to the number of w32time samples you wish to include 
[int]$Samples = '1'
# For testing - assign values instead of paramtersto the script
#[string]$RefServer = 'dc1.opsmgr.net'
#[int]$Threshold = '10'
#=================================================================================

# Gather script start time
$StartTime = Get-Date

# Gather who the script is running as
$WhoAmI = whoami

# Load MomScript API and PropertyBag function 
$momapi = new-object -comObject 'MOM.ScriptAPI'
$bag = $momapi.CreatePropertyBag()

#Log script event that we are starting task
$momapi.LogScriptEvent($ScriptName,9250,0, "Starting script")


#Start MAIN body of script:

#Getting the required data
$cmd = w32tm /stripchart /computer:$RefServer /dataonly /samples:$Samples

IF ($cmd -match 'error')
{
  #Log error and quit
  $momapi.LogScriptEvent($ScriptName,9250,2, "Getting TimeDrift from Reference Server returned an error .  Reference server is ($RefServer).  Output of command is ($cmd)")
  exit
}
ELSE
{
  #Assume we got good results from cmd
  $Skew = $cmd[-1..($Samples * -1)] | ConvertFrom-Csv -Header "Time","Skew" | Select -ExpandProperty Skew
  $Result = $Skew | % { $_ -replace "s","" }   | Measure-Object -Average | select -ExpandProperty Average
}

#The problem is that you can have time skew in two directions: positive or negative. You can do two
#things: create an IF statement that does check both or just create a positive number.
IF ($Result -lt 0) { $Result = $Result * -1 }

$TimeDriftSeconds = [math]::Round($Result,2)

#Determine if the average time skew is higher than your threshold and report this back to SCOM.
IF ($TimeDriftSeconds -gt $Threshold)
{
  $bag.AddValue("TimeSkew","True")
  $momapi.LogScriptEvent($ScriptName,9250,2, "Time Drift was detected.  Reference server is ($RefServer).  Threshold is ($Threshold) seconds.  Value is ($TimeDriftSeconds) seconds")
}
ELSE
{
  $bag.AddValue("TimeSkew","False")
  #Log good event for testing
  #$momapi.LogScriptEvent($ScriptName,9250,0, "Time Drift was OK.  Reference server is ($RefServer).  Threshold is ($Threshold) seconds.  Value is ($TimeDriftSeconds) seconds") 
}

#Add stuff into the propertybag
$bag.AddValue("RefServer",$RefServer)
$bag.AddValue("Threshold",$Threshold)
$bag.AddValue("TimeDriftSeconds",$TimeDriftSeconds)

#Log an event for script ending and total execution time.
$EndTime = Get-Date
$ScriptTime = ($EndTime - $StartTime).TotalSeconds
$ScriptTime = [math]::Round($ScriptTime,2)
$momapi.LogScriptEvent($ScriptName,9250,0,"`n Script has completed. `n Reference server is ($RefServer). `n Threshold is ($Threshold) seconds. `n Value is ($TimeDriftSeconds) seconds. `n Runtime was ($ScriptTime) seconds.")

#Output the propertybag  
$bag

Next, we will put the script into a Probe action, which will be called by a Datasource with a scheduler. The reason we want to break this out, is because we want to “share” this datasource between a monitor and rule. The monitor will monitor for the time skew, while the rule will collect the skew as a perf counter, so we can monitor for trends in the environment.

So the key components of the MP are the DS, the PA (containing the script), the MonitorType and the Monitor, the Perf collection rule, and some views to show this off:

When a threshold is breached, the monitor raises an alert:

The performance view will show you the trending across your systems:

On the monitor (and rule) you can modify the reference server:

One VERY IMPORTANT concept – if you change anything – you must make identical overrides on BOTH the monitor and the rule, otherwise you will break cookdown, and result in the script running twice for each interval. So be sure to set the IntervalSeconds, RefServer, and Threshold the same on both the monitor and the rule. If you want the monitor to run much more frequently than the default once an hour, that’s fine, but you might not want the perf data collected more than once per hour, so while that will break cookdown, it only breaks once per hour, which is probably less of an impact than overcollecting performance data.

From here, you could add in a recovery to force a resync of w32time if you wanted, or add in additional alert rules for w32time events.

The example MP is available here:

https://gallery.technet.microsoft.com/SCOM-Management-Pack-to-bca30237

↧

How to create a SCOM group from an Active Directory Computer Group

August 26, 2017, 12:37 pm

≫ Next: Event 18054 errors in the SQL application log – in SCOM 2016 deployments

≪ Previous: Monitoring for Time Drift in your enterprise

There have been a bunch of examples of this published over the years. Some of them worked well, but I was never happy with many of them as they were often vbscript based, hard to troubleshoot, and required lots of editing each time you wanted to reuse them. Many were often error prone, and didn’t work if the AD group contained computers that didn’t exist in SCOM, as SCOM will reject the entire discovery data payload in that case.

If you too were looking for a reliable and easy way to do this, well, look no further! I have created an MP Fragment in my fragment library for this:

https://gallery.technet.microsoft.com/SCOM-Management-Pack-VSAE-2c506737

This MP Fragment will make creating SCOM groups of Windows Computers from Active Directory groups super easy! This is a nice way to “delegate” the ability for end users to control what servers will appear in their scopes, as they often have the ability to easily add and remove computers from their AD groups, but they do not have access to SCOM Group memberships.

I am going to demonstrate using Silect MP Author Pro to reuse this Fragment, and you can also easily use Visual Studio with VSAE. If you’d like to read more on either of those, see:

https://blogs.technet.microsoft.com/kevinholman/2016/06/04/authoring-management-packs-the-fast-and-easy-way-using-visual-studio/

https://blogs.technet.microsoft.com/kevinholman/2017/03/22/management-pack-authoring-the-really-fast-and-easy-way-using-silect-mp-author-and-fragments/

In Silect MP Author Pro – create a new, empty management pack, and select “Import Fragment”

Browse the fragment and choose: Class.Group.ADGroupWindowsComputers.mpx

We need to simply input the values here, such as:

Click “Import”

Silect MP Author Pro will automagically handle the references for you, so just say “Yes” on the popup:

That’s IT!

Save it, and deploy it!

If you look in SCOM after a few minutes – you should see your group:

The rule to populate it runs once a day by default, but it will run immediately upon import. Look for event ID 7500 in the OpsMgr event log on the Management Server that hosts your All Management Servers Resource Pool object

Once you see these events and no errors in them – you can view group membership in SCOM:

So easy. And you don’t have to know anything about XML, or even Management Packs to do it!

Using Visual Studio with VSAE works exactly the same way – you simply have to do a manual Find/Replace for each item. See the VSAE method in the link above.

Want to dig deeper into how this is put together? Read on:

The MP we generate is very basic. There is a Class (the Group definition) a Relationship (the Group contains Windows Computers) and a discovery (queries AD and discovers the relationship to the existing Windows Computers in SCOM)

The script is below:

We basically connect to AD, find the group by name, query to get members, look the membership up and see if they exist in SCOM, if they do, add them to the group.

We will log events along the way to help in troubleshooting if anything doesn’t work, and record the completion and total script runtime, like all my SCOM scripts.

#=================================================================================
#  Group Population script based on AD group membership
#
#  Kevin Holman
#  v1.2
#=================================================================================
param($SourceID, $ManagedEntityID, $ADGroup, $LDAPSearchPath)


# Manual Testing section - put stuff here for manually testing script - typically parameters:
#=================================================================================
# $SourceId = '{00000000-0000-0000-0000-000000000000}'
# $ManagedEntityId = '{00000000-0000-0000-0000-000000000000}'
# $ADGroup = "SCOM Computers Group"
# $LDAPSearchPath = "LDAP://DC=opsmgr,DC=net"
#=================================================================================


# Constants section - modify stuff here:
#=================================================================================
# Assign script name variable for use in event logging
$ScriptName = "##CompanyID##.##AppName##.##GroupNameNoSpaces##.Group.Discovery.ps1"
$EventID = "7500"
#=================================================================================


# Starting Script section - All scripts get this
#=================================================================================
# Gather the start time of the script
$StartTime = Get-Date
# Load MOMScript API
$momapi = New-Object -comObject MOM.ScriptAPI
# Load SCOM Discovery module
$DiscoveryData = $momapi.CreateDiscoveryData(0, $SourceId, $ManagedEntityId)
#Set variables to be used in logging events
$whoami = whoami
#Log script event that we are starting task
$momapi.LogScriptEvent($ScriptName,$EventID,0,"`n Script is starting. `n Running as ($whoami).")
#=================================================================================


# Connect to local SCOM Management Group Section
#=================================================================================
# Clear any previous errors
$Error.Clear()
# Import the OperationsManager module and connect to the management group
$SCOMPowerShellKey = "HKLM:\SOFTWARE\Microsoft\System Center Operations Manager\12\Setup\Powershell\V2"
$SCOMModulePath = Join-Path (Get-ItemProperty $SCOMPowerShellKey).InstallDirectory "OperationsManager"
Import-module $SCOMModulePath
New-DefaultManagementGroupConnection "localhost"
IF ($Error)
{
  $momapi.LogScriptEvent($ScriptName,$EventID,1,"`n FATAL ERROR: Failure loading OperationsManager module or unable to connect to the management server. `n Terminating script. `n Error is: ($Error).")
  EXIT
}
#=================================================================================


# Begin MAIN script section
#=================================================================================
#Log event for captured parameters
$momapi.LogScriptEvent($ScriptName,$EventID,0,"`n ADGroup: ($ADGroup) `n LDAP search path: ($LDAPSearchPath).")

# Connect to AD using LDAP search to find the DN for the Group
$Searcher = New-Object DirectoryServices.DirectorySearcher
$Searcher.Filter = '(&(objectCategory=group)(cn=' + $ADGroup + '))'
$Searcher.SearchRoot = $LDAPSearchPath
$Group = $Searcher.FindAll()

$GroupDN = @()

# Now that we have the group object, trim to get the DN in order to search for members
$GroupDN = $Group.path.TrimStart("LDAP://")

#If we found the group in AD by the DisplayName log a success event otherwise log error
IF ($GroupDN)
{
  $momapi.LogScriptEvent($ScriptName,$EventID,0,"`n Successfully found group in AD: ($GroupDN).")
}
ELSE
{
  $momapi.LogScriptEvent($ScriptName,$EventID,1,"`n FATAL ERROR: Did not find group in AD: ($ADGroup) using ($LDAPSearchPath). `n Terminating script.")
  EXIT
}

# Search for members of the group
$Searcher.Filter = '(&(objectCategory=computer)(memberOf=' + $GroupDN + '))'
$ADComputerObjects = $Searcher.FindAll()
$ADComputerObjectsCount = $ADComputerObjects.Count
If ($ADComputerObjectsCount -gt 0)
{
  $momapi.LogScriptEvent($ScriptName,$EventID,0,"`n Successfully found ($ADComputerObjectsCount) members in the group: ($GroupDN).")
}
Else
{
  $momapi.LogScriptEvent($ScriptName,$EventID,1, "`n FATAL ERROR: Did not find any members in the AD group: ($GroupDN). `n Terminating script.")
  EXIT
}

# Set namelist array to empty
$namelist = @()

# Loop through each computer object and get an array of FQDN hostnames
FOREACH ($ADComputerObject in $ADComputerObjects)
{
  [string]$DNSComputerName = $ADComputerObject.Properties.dnshostname
  $namelist += $DNSComputerName
}

# Check SCOM and get back any matching computers
# This is necesasary to filter the list for relationship discovery because if we return any computers missing from SCOM the Management Server will reject the discovery
# We are using the namelist array of FQDNs passed to Get-SCOMClassInstance to only pull back matching systems from the SDK as opposed to getting all Windows Computers then parsing which is assumed slower in large environments
$ComputersInSCOM = Get-SCOMClassInstance -Name $namelist
$ComputersInSCOMCount = $ComputersInSCOM.Count
# Logging event
$momapi.LogScriptEvent($ScriptName,$EventID,0,"`n Found ($ComputersInSCOMCount) matching computers in SCOM from the ($ADComputerObjectsCount) total computers in the AD group ($GroupDN).")

#Discovery Section
#Set the group instance we will discover members of
$GroupInstance = $DiscoveryData.CreateClassInstance("$MPElement[Name='##CompanyID##.##AppName##.##GroupNameNoSpaces##.Group']$")

# Loop through each SCOM computer and add a group membership containment relationship to the discovery data
FOREACH ($ComputerInSCOM in $ComputersInSCOM.DisplayName)
{
  $ServerInstance = $DiscoveryData.CreateClassInstance("$MPElement[Name='Windows!Microsoft.Windows.Computer']$")
  $ServerInstance.AddProperty("$MPElement[Name='Windows!Microsoft.Windows.Computer']/PrincipalName$", $ComputerInSCOM)
  $RelationshipInstance = $DiscoveryData.CreateRelationshipInstance("$MPElement[Name='##CompanyID##.##AppName##.##GroupNameNoSpaces##.Group.Contains.Windows.Computers']$")
  $RelationshipInstance.Source = $GroupInstance
  $RelationshipInstance.Target = $ServerInstance
  $DiscoveryData.AddInstance($RelationshipInstance)
}

# Return Discovery Items Normally           
 $DiscoveryData

# Return Discovery Bag to the command line for testing (does not work from ISE)
# $momapi.Return($DiscoveryData)
#=================================================================================
# End MAIN script section



# End of script section
#=================================================================================
#Log an event for script ending and total execution time.
$EndTime = Get-Date
$ScriptTime = ($EndTime - $StartTime).TotalSeconds
$momapi.LogScriptEvent($ScriptName,$EventID,0,"`n Script Ending. `n Script Runtime: ($ScriptTime) seconds.")
#=================================================================================
#End Script

Key recommendations:

1. Don’t run your frequency <intervalseconds> too often. If updating the group once a day is ok, leave it at the default. If you need it more frequent, that’s fine, just remember it’s a script, and all scripts running on the management servers have an impact on the overall load, plus we are submitting discovery data about relationships each time, and searching through AD.

2. The default timeout is set to 5 minutes. If you cannot complete this in less, something is WRONG and it most likely will be how long it takes to find the group in AD. If that is true for you, you need to optimize the section on querying AD and LDAP search path.

3. If you have a lot of AD based SCOM groups, consider adding a staggered sync time to each discovery, so they don’t all run at the same time, or on the same interval.

↧