Better Monitoring and Analytics 

Effective fault monitoring is a requirement for any production-quality operation, especially given that the agreed upon Service Level Agreements (SLAs) are meaningless without real metrics to back them up.

The EIS SLA guarantees 99.5% uptime for our API Management services for our production services. We are able to demonstrate that we meet and exceed this via the Nagios monitoring provided by our partners in the IS&T Unix Team. The following screenshot shows all servers near the end of 2014, for October and for all of 2014, including development and QA servers (which also have excellent uptimes):

Uptime reported by Nagios November 2014

We are partnered with the IS&T Unix team and IS&T DBA teams for monitoring and 24x7 support on the operating system and database services, ensuring that there are multiple levels of support for our production services.

But in discussions with our API customers about their service requirements, it became clear that while our own API Services may be functioning perfectly, there might be issues with the source systems that are not reflected in our monitoring. Improving the service that we offer to our customers meant that we needed to expand the monitoring to include details of the individual APIs. We are in the midst of rolling out this expanded monitoring and have engaged both our API consumers and API providers in the process.

These changes can be seen in the snapshot below of the current monitoring for API services. These API endpoints cover the vast majority of the API requests that come through our system, and we continue to expand the coverage and engage our upstream and downstream customers in the notification system so that issues can be identified and resolved rapidly and efficiently:

You might ask yourself, “How do we know that these endpoints cover the vast majority of actual usage?’

The answer is analytics, which is the ability to analyze the usage of your services. While it's a very trendy term for a very old problem, EIS has deployed a very modern analytics package and integrated it into our logging system so that we have real time information about usage. Here is an example of a report on several weeks of activity on our API services:

Kibana Dashboard of API Activity

What you can see is that we have detailed reporting of both what APIs are being used, and who is using them. In addition, the histogram shows several days with red and black activity. Red indicates errors that are typically caused by issues on the API Consumer side, while black indicates issues typically for the API Producer. Notice that those red and black areas have diminished to virtually zero during the second half of the graph.

That is because our analytics tools enabled us to identify not only overall trends, but also ongoing issues that we worked on with our customers to resolve. This analytics infrastructure provides actionable information that enables us to improve the quality of the services that we provide. We are working with our API customers to provide this information in an API form so that it can be used for notifications, forecasting, real-time reporting, and other uses as they arise.

 These additional services are part of the EIS's core business of making API's more manageable and transparent for our customers.