Real User Monitoring (RUM) and Point of Presence (PoP)

Background

Traditional web applications are mostly server-side rendered with simple network topology.

Modern web application has complex network topology and heavier frontend processing: * Large web app usually has a complex network topology, which introduces more layers of possible failures. * In most Single Page Application frameworks(E.g. EmberJS, AngularJS), the page content is mostly built on the client browser instead of the web server.

These factors make traditional performance monitoring incompetent at its job.

Page Load Time(PLT) breakdown of a single-page application:

waterfall-view-of-plt-breakdown.jpg

waterfall-view-of-plt-breakdown.jpg

Page Load Time Definition

page-load-time.png

page-load-time.png

What is RUM?

Real User Monitoring is a type of performance monitoring that captures and analyzes each transaction by users of a website or application.

  • Real
  • Entire stack monitoring instead of just server-side monitoring. Compare to traditional server-side monitoring, it also monitor client-side and network performance metrics.
  • It's more real in the sense that it measures how fast users are experiencing, not servers'.
  • All Traffic
  • Monitor the overall site performance from all the users' data.
  • You can collect all users' metrics without user experience impact

RUM is now the industry standard way to monitor the overall site performance. But, it is also a great data source for network performance analysis since RUM data has the user IP address along with various network performance metrics such as connect time, first byte time, page download time, and so on.

How it works?

In short, most modern browsers implemented the APIs for reading network timing metrics

  1. Each page is embedded with a javascript library that collects user data and performance metrics.
  2. The javascript library reads the Navigation Timing API(and other APIs) exposed by the browser.
  3. It then sends data and metrics back to a server after page finishes loading.

The PerformanceNavigationTiming interface

Below diagram is the timing attributes defined by the PerformanceNavigationTiming interface.

From the diagram, you will have an idea about what kind of data does RUM collect: performance-navigation-timing-diagram.svg

Attributes in parenthesis indicate that they may not be available for navigations involving documents from different origins.

w3c: The PerformanceNavigationTiming interface


How LinkedIn uses Boomerang library

Boomerang is one the RUM libraries.

  1. Boomerang library runs on the browser when our page is loaded and collects performance data.
  2. It does so mostly by reading the Navigation Timing API object exposed by the browsers.
  3. RUM sends the performance data to LinkedIn's beacon servers after page load.
  4. The beacon servers then send the event to our Hadoop cluster using Kafka.

RUM Data Processing and Visualization in LinkedIn

gospeed-visualization-infra.png

gospeed-visualization-infra.png

What is PoP and how it helps?

What is PoP?

PoPs are small scale data centers with mostly network equipment and proxy servers; that act as end-points for user's TCP connection requests. PoP would establish and hold that connection while fetching the user-requested content from the data center.

An Internet point of presence typically houses servers, routers, network switches, multiplexers, and other network interface equipment.

Sequence diagram of established TCP connections

user-pop-dc-diagram.png

user-pop-dc-diagram.png

  1. DNS time: not in the diagram
  2. Connection time: user’s browser connects to the PoP (TCP + SSL handshake).
  3. First byte time start: HTTP request from User to PoP
  4. HTTP request from PoP to Data Center(DC)
  5. DC early flush: usually the of the HTML
  6. First byte time end & page down load time start: user receives first byte HTTP response from PoP.
  7. DC builds the rest of the page and sends back to PoP.
  • Since the page is sent on an existing TCP connection, this TCP connection will likely have large TCP congestion windows. Thus the whole page could be potentially sent in one round trip time (RTT).
  1. HTTP response from PoP.
  • As PoP receives the page, it relays the page packet by packet to the browser.
  • Since the PoP to browser connection is usually not a long-lived connection, the congestion windows at this point are much smaller.
  • TCP’s slow start algorithm kicks in and multiple RTTs are needed to finish serving the page to the browser.

How it helps?

  • PoP decides the optimal data center to serve a given request.
  • PoP leverages/reuses existing connections to a datacenter
  • less overhead for establishing TCP connection
  • establish large TCP congestion window faster

RUM Use Cases in LinkedIn


Finding optimal PoP per geography using RUM

Problem to solve

LinkedIn uses DNS for PoP selection. LinkedIn owns multiple PoPs around the world scattered in different regions. The question is: given a request, how to find the optimal PoP?

Existing solutions

Here are a few techniques we evaluated which did not work:

  • Geographic distance: The simplest approach is to assume that the geographically closest PoP is the optimal PoP. Unfortunately

It is well known in the networking community that geographical proximity does not guarantee network proximity.

  • Network connectivity: Our Network Engineering team could have just assigned geographies to PoPs based on their understanding and knowledge of global internet connectivity.

The Internet is changing all the time. Manual approach may not keep up with the change and requires a lot more operational costs.

  • Synthetic measurements: We could also run synthetic tests using monitoring companies such as Keynote, Gomez, Catchpoint, and so on. These companies have a set of monitoring agents distributed across the world that can test your website.

Well known problems with this approach include:

  • Agent geographic and network distribution may not represent our user base.
  • Agents usually have very good connectivity to the internet backbone, which may not be representative of our user base.

The RUM solution

LinkedIn extends RUM framework to measure latency from users to all the PoPs. This is done by downloading a tiny object from each PoP after the page is loaded and measuring the duration to download the object.

Some caveats:

  • Overhead: the download overhead can be intrusive for some users.
  • To minimize the impact, collect only 1% of the total page views. It's big enough for LinkedIn to solve the problem.
  • Offline job: need a daily Hadoop job that aggregates this data and decides the optimal PoP per geography.

RUM can also help solving these questions:

  • Did traffic really shift? How do we ensure that traffic in that geography is actually going to that PoP?
  • Did performance improve?
  • How do I pick between two PoPs? (If two or more PoPs have very similar results from the PoP Beacons data)
  • How do I pick between a PoP and a DC? A PoP has to fetch the content from a data center. If PoP Beacons data shows a PoP as the closest, but a data center as a close second, which one would really be optimal? Note that users can directly connect to the data centers as well because data centers also act as PoPs.

The answers to all these questions can be found by identifying which PoP actually served a particular page view and tying that information to other RUM metrics. For the last two questions, we could run A/B tests between the candidate PoPs to identify the optimal one.

Click for more detail


Detect latency leak using RUM

Problem to solve:

AKA site speed regression: small latency increase is hard to detect, and it'll be a big performance degradation overtime. How to detect latency leaks and pinpoint the cause?

This is another common use case for RUM to solve.

You can click here for more detail about how LinkedIn solves it.

References