Preventing Transient Bugs

This page covers architectural patterns and tips for developers to follow to prevent transient bugs.

Common root causes

We've noticed a few root causes that come up frequently when addressing transient bugs.

  • Needs better state management in the backend or frontend.
  • Frontend code needs improvements.
  • Lack of test coverage.
  • Race conditions.

Frontend

Don't rely on response order

When working with multiple requests, it's easy to assume the order of the responses matches the order in which they are triggered.

That's not always the case and can cause bugs that only happen if the order is switched.

Example:

  • diffs_metadata.json (lighter)
  • diffs_batch.json (heavier)

If your feature requires data from both, ensure that the two have finished loading before working on it.

Simulate slower connections when testing manually

Add a network condition template to your browser's developer tools to enable you to toggle between a slow and a fast connection.

Example:

  • Turtle:
    • Down: 50kb/s
    • Up: 20kb/s
    • Latency: 10000ms

Collapsed elements

When setting event listeners, if not possible to use event delegation, ensure all relevant event listeners are set for expanded content.

Including when that expanded content is:

  • Invisible (display: none;). Some JavaScript requires the element to be visible to work properly, such as when taking measurements.
  • Dynamic content (AJAX/DOM manipulation).

Using assertions to detect transient bugs caused by unmet conditions

Transient bugs happen in the context of code that executes under the assumption that the application's state meets one or more conditions. We may write a feature that assumes a server-side API response always include a group of attributes or that an operation only executes when the application has successfully transitioned to a new state.

Transient bugs are difficult to debug because there isn't any mechanism that alerts the user or the developer about unsatisfied conditions. These conditions are usually not expressed explicitly in the code. A useful debugging technique for such situations is placing assertions to make any assumption explicit. They can help detect the cause which unmet condition causes the bug.

Asserting pre-conditions on state mutations

A common scenario that leads to transient bugs is when there is a polling service that should mutate state only if a user operation is completed. We can use assertions to make this pre-condition explicit:

// This action is called by a polling service. It assumes that all pre-conditions
// are satisfied by the time the action is dispatched.
export const updateMergeableStatus = ({ commit }, payload) => {
  commit(types.SET_MERGEABLE_STATUS, payload);
};

// We can make any pre-condition explicit by adding an assertion
export const updateMergeableStatus = ({ state, commit }, payload) => {
  console.assert(
    state.isResolvingDiscussion === true,
    'Resolve discussion request must be completed before updating mergeable status'
  );
  commit(types.SET_MERGEABLE_STATUS, payload);
};

Asserting API contracts

Another useful way of using assertions is to detect if the response payload returned by the server-side endpoint satisfies the API contract.

Related reading

Debug it! explores techniques to diagnose and fix non-deterministic bugs and write software that is easier to debug.

Backend

Sidekiq jobs with locks

When dealing with asynchronous work via Sidekiq, it is possible to have 2 jobs with the same arguments getting worked on at the same time. If not handled correctly, this can result in an outdated or inaccurate state.

For instance, consider a worker that updates a state of an object. Before the worker updates the state (for example, #update_state) of the object, it needs to check what the appropriate state should be (for example, #check_state).

When there are 2 jobs being worked on at the same time, it is possible that the order of operations will go like:

  1. (Worker A) Calls #check_state
  2. (Worker B) Calls #check_state
  3. (Worker B) Calls #update_state
  4. (Worker A) Calls #update_state

In this example, Worker B is meant to set the updated status. But Worker A calls #update_state a little too late.

This can be avoided by utilizing either database locks or Gitlab::ExclusiveLease. This way, jobs will be worked on one at a time. This also allows them to be marked as idempotent.

Retry mechanism handling

There are times that an object/record will be on a failed state which can be rechecked.

If an object is in a state that can be rechecked, ensure that appropriate messaging is shown to the user so they know what to do. Also, make sure that the retry functionality will be able to reset the state correctly when triggered.

Error Logging

Error logging doesn't necessarily directly prevents transient bugs but it can help to debug them.

When coding, sometimes we expect some exceptions to be raised and we rescue them.

Logging whenever we rescue an error helps in case it's causing transient bugs that a user may see. While investigating a bug report, it may require the engineer to look into logs of when it happened. Seeing an error being logged can be a signal of something that went wrong which can be handled differently.