Observability for Microcks at scale

🗓️ March 12, 2024 | 8

Observability for Microcks at scale

As part of the upcoming 1.9.0 release of Microcks, I’m super proud to have contributed new features related to its observability and performance monitoring! As supporting the Open Source ecosystem is part of my day job at Grafana Labs, I was really excited by this collaboration with the Microcks project to put into practice the use of OpenTelemetry, a project that is also part of the CNCF.

Why it matters?

Microcks can be used and deployed in many topologies: from ephemeral instances with few APIs & services to always-up-and-running instances serving complex ecosystems of APIs in large organizations. Within this wide range of use cases, Microcks can also be used in:

Short-lived instances such as on-demand sandboxes,
Quality Assurance environments,
Performance testing environments.

For deployments at scale, the project received the usual questions from the community:

How much can a single Microcks instance handle?
Can I use it for very large performance testing campaigns?
How does it scale?

The maintainers needed to be able to provide the proper insights with the right tooling to answer these questions. The first step would be to be able to accurately measure the performance of Microcks to get a grasp of what a single instance could really deliver . In addition, some large organizations running Microcks and having it as a mainstream solution started to be concerned about the upgrades. Hence, those legit questions:

Is the new 1.8.0 release lighter and better than the previous one?
Should I upgrade my MongoDB engine for better performance?
Will it bring some performance degradation?

Those questions fall into the realm of continuous improvement. Therefore, the second requirement this contribution covers is understanding where errors or performance degradation could come from to facilitate code optimizations.

Microcks already provided Prometheus endpoints for metrics but to get deeper insights, it is necessary to also collect logs and traces. Furthermore, there needed to be a way to generate load in order to help with the capacity provisioning of Microcks instances.

What’s in the box?

As part of the CNCF ecosystem, it was a natural decision with the maintainers that the way to go was to provide a comprehensive integration with the OpenTelemetry initiative. OpenTelemetry is a collection of APIs, SDKs, and tools that provide an open, vendor-agnostic way to instrument, generate, collect, transform, and export telemetry data.

In addition to the Prometheus endpoints still present, Microcks 1.9.0 can now be deployed with OpenTelemetry instrumentation. With that configuration enabled, the metrics, logs, and distributed traces produced by Microcks can be sent via OTLP (OpenTelemetry Line Protocol) to any OpenTelemetry Collector service. Enabling this configuration is very straightforward, you just have to set two environment variables during Microcks’ deployment:

OTEL_JAVAAGENT_ENABLED must be set to true, this activates the OpenTelemetry instrumentation with the OpenTelemetry Java Agent.
OTEL_EXPORTER_OTLP_EXPORT must be set to a collector URL like [http://otel-collector.acme.com:4317](http://otel-collector.acme.com:4317). By default, it uses the OTLP/GRPC protocol.

You can check the project’s OpenTelemetry documentation for more information.

Aside from the telemetry data collection, with this contribution, Microcks also provides a comprehensive Grafana dashboard. That way you get a direct digest of all the collected information with instant access to performance metrics per mock endpoints, including TPS and response time percentile information. The backends used here to store the telemetry data on which the Grafana Dashboard is built are Prometheus for the metrics, Loki for the logs, and Tempo for the traces. This enables seamless correlation of all 3 telemetry signals to analyze performance trends, discover potential issues, and identify bottlenecks.

You can check the project’s Grafana documentation for more information

Finally, as generating load on Microcks can be complex for new users, we also added a benchmarking suite to Microcks 1.9.0! Easy to go for beginners, this suite allows you to simulate Virtual Users on different usage scenarios and gather raw performance metrics of your instance. Thanks to the K6 load and performance testing suite, it’s easy to run and tune to generate a load representative of your expected usage (browsing / invoking REST mocks / invoking Graph mocks / etc..)

Note that you can use this benchmark suite without necessarily enabling the OpenTelemetry and Grafana features. You can check the project’s Benchmark documentation for more information.

Walkthrough

Want to see all of this in action? Then, go through our guided tour just below.

We will start hereafter with the Microcks Grafana dashboard displaying the metrics and the logs. You’ll see that we have used the popular RED method - (short for rate, error, and duration) - to structure this dashboard. This gives an overview of the performance & general behavior of each mock service. Users can now correlate metrics logs and traces to better understand how mocks behave. Using the timeline it is possible to narrow to problematic timeframes, focus on a small set of logs, and analyze the corresponding traces. You’ll also get the response time distribution and their percentiles.

The percentile panels show little dots that link to examples of traces that have a certain response time. This allows the user to isolate significant traces that represent a potentially problematic execution easily.

On the bottom pane of the dashboard, you get access to the latest Microcks logs. As you can see, some log lines may be enriched with a traceId by the OpenTelemetry instrumentation. If you have other services calling Microcks instrumented with OpenTelemetry, the traceId is automatically propagated and it’s then possible to jump to these trace details and get the visualization on the full end-to-end trace:

Those traces are the ideal way to diagnose slow components within your services call sequences and check that optimizations work (BTW, Microcks now provides MongoDB optimization hints within the benchmark suite 😉). From every trace, it’s also possible to isolate the log related to a traceID to see the messages that were emitted during a span or the whole trace:

To get the above data and visualizations, we ran the benchmarking suite powered by K6 which launches four different scenarios simultaneously during one minute. Here’s the raw output we got below with details on executed scenarios and raw performance metrics:

docker run --rm -i -e BASE_URL=${MICROCKS_BASE_URL} -e WAIT_TIME=0.1 grafana/k6:${K6_VERSION} run - < bench-microcks.js


          /\     |‾‾| /‾‾/   /‾‾/   
     /\  /  \    |  |/  /   /  /    
    /  \/    \   |     (   /   ‾‾\ 
   /          \  |  |\ \ |  (‾)  | 
  / __________ \ |__| \__\ \_____/ .io

  execution: local
     script: -
     output: -

  scenarios: (100.00%) 4 scenarios, 85 max VUs, 2m45s max duration (incl. graceful stop):
           * browse: 20 looping VUs for 1m0s (exec: browse, gracefulStop: 30s)
           * invokeRESTMocks: 200 iterations for each of 40 VUs (maxDuration: 2m0s, exec: invokeRESTMocks, startTime: 5s, gracefulStop: 30s)
           * invokeGraphQLMocks: 100 iterations for each of 20 VUs (maxDuration: 2m0s, exec: invokeGraphQLMocks, startTime: 10s, gracefulStop: 30s)
           * invokeSOAPMocks: 5 iterations for each of 5 VUs (maxDuration: 2m0s, exec: invokeSOAPMocks, startTime: 15s, gracefulStop: 30s)
           
[...]

running (1m04.0s), 14/85 VUs, 10271 complete and 0 interrupted iterations
browse             ↓ [ 100% ] 20 VUs  1m0s         
invokeRESTMocks    ✓ [ 100% ] 40 VUs  0m12.7s/2m0s  8000/8000 iters, 200 per VU
invokeGraphQLMocks ✓ [ 100% ] 20 VUs  0m06.9s/2m0s  2000/2000 iters, 100 per VU
invokeSOAPMocks    ✓ [ 100% ] 5 VUs   0m16.0s/2m0s  25/25 iters, 5 per VU

     ✓ status code should be 200
     ✓ pastryCall status is 200
     ✓ eclairCall status is 200
     ✓ eclairXmlCall status is 200
     ✓ eclairXmlCall response is Xml
     ✓ millefeuilleCall status is 200
     ✓ allFilmsCall status is 200
     ✓ aFilmCall status is 200
     ✓ aFilmFragmentCall status is 200
     ✓ andrewCall status is 200
     ✓ karlaCall status is 200
     ✓ karlaCall body is correct
     ✓ laurentCall status is 500
     ✓ laurentCall body is fault

     checks.........................: 100.00% ✓ 46385      ✗ 0    
     data_received..................: 132 MB  2.0 MB/s
     data_sent......................: 7.8 MB  122 kB/s
     http_req_blocked...............: avg=40.83µs  min=291ns   med=1.04µs  max=18.4ms   p(90)=5.04µs   p(95)=8.37µs  
     http_req_connecting............: avg=33.21µs  min=0s      med=0s      max=18.35ms  p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=17.22ms  min=1ms     med=12.57ms max=782.36ms p(90)=28.2ms   p(95)=36.67ms 
       { expected_response:true }...: avg=17.21ms  min=1ms     med=12.57ms max=782.36ms p(90)=28.2ms   p(95)=36.66ms 
     http_req_failed................: 0.05%   ✓ 26         ✗ 48709
     http_req_receiving.............: avg=80.43µs  min=6.5µs   med=22.66µs max=29.12ms  p(90)=129.29µs p(95)=235.55µs
     http_req_sending...............: avg=15.04µs  min=1.58µs  med=4.95µs  max=9.27ms   p(90)=22.33µs  p(95)=36.83µs 
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s      max=0s       p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=17.12ms  min=902.5µs med=12.49ms max=782.34ms p(90)=28.08ms  p(95)=36.49ms 
     http_reqs......................: 48735   756.508316/s
     iteration_duration.............: avg=194.32ms min=10.5ms  med=52.16ms max=5.57s    p(90)=101.2ms  p(95)=177.01ms
     iterations.....................: 10285   159.652981/s
     vus............................: 14      min=14       max=85 
     vus_max........................: 85      min=85       max=85 


running (1m04.4s), 00/85 VUs, 10285 complete and 0 interrupted iterations
browse             ✓ [ 100% ] 20 VUs  1m0s         
invokeRESTMocks    ✓ [ 100% ] 40 VUs  0m12.7s/2m0s  8000/8000 iters, 200 per VU
invokeGraphQLMocks ✓ [ 100% ] 20 VUs  0m06.9s/2m0s  2000/2000 iters, 100 per VU
invokeSOAPMocks    ✓ [ 100% ] 5 VUs   0m16.0s/2m0s  25/25 iters, 5 per VU

And yes, we got this impressive 756.5 hits/second with a p(90) response time of 28.2ms during the bench on a Macbook M2 with a 400MB heap! 🚀

Conclusion

The Microcks user community expressed their wish to know what a single instance of Microcks is able to deliver in terms of throughput and response time. Through contribution we made Microcks ready to be fully observable, and we enabled optimization opportunities for very large-scale deployments. The project and the community users are now able to run benchmarks in autonomy to get figures and have a precise idea of what level of performance Microcks is capable of delivering.

On a personal note, OpenTelemetry is the second-largest CNCF project and it’s a big challenge to navigate in its ecosystem. This has been a good experience to find the nominal path to get the instrumentation, the storage of telemetry data, and visualization up and running for a real project. But, yes! I did it! 💪

Alain Pham

Principal Solutions Engineer at Grafana Labs