Hi Foglight.ORGers.
In the first part of this post, I claimed that approaching monitoring from the point of view of ‘we need to monitor the methods’, while it has value, isn’t in general the best way to ensure that the Mean Time To Resolution (MTTR) of production problems is reduced.
So what’s the alternative approach? Well, this is the approach we follow in Foglight:
1. We realise that the most important thing is the service. This might be a single request, a type of request or something more complex. In addition to monitoring the containers, JVMs etc (I won’t talk about those here), our Java agent starts by discovering the requests and keeping track of their evolving performance health.
2. Against each request type SLA/OLAs can be defined. So you can say ‘let me know if a request of this type takes >10s’ or whatever you fancy (Foglight allows you to define very complex thresholds if you like, but that’s another conversation).
3. Once you get an alert on a particular request type, you can (automatically or on-demand) capture complete cross-VM call graphs for problematic requests of this request type. These can then be analysed.
So how does this approach avoid the issues I describe above?
1. You can decide how slow is ‘too slow’ on a per-request type basis and so you don’t waste time/resources collecting and analysing slow method calls that turn out not to be associated with key requests. Or which turn out to be the slowest method, but which aren’t slow enough to cause problems with the request which calls them.
2. You can see what’s calling a slow method/servlet/EJB. So you have context. Maybe the problem is really with the caller not the callee.
3. If faced with several performance issues at once, you can choose which to prioritise based on the requests with which those issues are associated (i.e. problems with people paying online trump, say, a user logging out of the system).
In summary, I believe our philosophy on JEE production performance monitoring means you are less likely to get distracted by unimportant performance issues, or to spend time analysing technical data without any context. Avoiding these issues reduces your MTTR and saves stress. Which is why we do it this way.
Of course, there are specific use cases for which it’s perfectly reasonable to want to start by monitoring named methods, but that’s not a general case and is for another conversation (and, yes, Foglight can do that too if you want). Personally, I think some of the vendors out there who come from more of a pre-production background blur this important distinction between pre-prod & prod performance management.
Thanks for reading this. As always, feel free to contact me directly at
hugh.mcevoy@quest.com with comments/objections/suggestions.
Cheers,
Hugh