Fixing Session Playtime Minutes Calculation: A Data Analysis and SQL Approach

Can you look at a set of numbers and tell if something’s not quite right? Yes, it’s possible—numbers don’t lie. But sometimes, they can be tricky to interpret if you don’t know what to look for. Imagine being given a table of data and being asked to find a problem hidden within it, with no other information. It might sound tough, but this is where a bit of analytics work with data comes in handy.

In this post, we’ll explore a real example where session duration statistics weren’t adding up as expected. There was a bug causing some sessions to last way too long and not being closed after 5 minutes of inactivity, which messed up the results. But how do you spot the issue just by looking at the data? And once you find it, how do you fix it?

Let’s walk through this process together. I’ll show you how we uncovered the problem and used SQL fix to make sure the data tells the right story.

Whether you’re into data analysis or just curious about how these things work, this example will show you how much you can discover by paying close attention to the details.

Prerequisites

measuredaily_playtime_minutes_per_user
count1192.000000
mean81.960668
std147.130393
min0.000000
25%11.212500
50%31.016667
75%83.016667
max1303.266667

Data Examination

Let’s start with data examination on the above table and explain what each measure represents.


Initial Observations

From the data, a few key issues stand out:

Solving the Issue

To fix the problems we’ve identified, here are the steps we can take:


Expected Improvements After Applying Corrections

Once we fix the issues by adjusting the outliers here’s what we can expect:

Overall, these changes will make the data more accurate and useful for understanding how users are really interacting with the platform.

Let’s look at the example and apply SQL logic to properly identify sessions that did not close properly after 5 minutes of inactivity, but they are still grouped under the same session (highlighted in yellow)

The SQL code below is split into separate code blocks for better understanding and explanation of each step.

Ex. session_id 2b20e77e9168443e9_12

  1. Prepares data by getting the previous timestamp for each event in the sequence, grouped by user_id and session_id. With this we can keep track of the timestamp of the current event and the previous event.
WITH TimedEvents AS (
    SELECT
        user_id,
        session_id,
        event,
        timestamp,
        COALESCE(LAG(timestamp) OVER (PARTITION BY user_id, session_id ORDER BY timestamp), timestamp) AS previous_timestamp
    FROM
        event_data
    WHERE
        session_id = '2b20e77e9168443e9_12'
)

2. Calculates the gap in seconds between consecutive events ( current timestamp of an event vs previous timestamp of an event in the same session_id)

SessionGaps AS (
    SELECT
        user_id,
        session_id,
        event,
        timestamp,
        previous_timestamp,
        strftime('%s', timestamp) - strftime('%s', previous_timestamp) AS gap_seconds
    FROM
        TimedEvents
)

3. Determines if a session exceeded (more than 5 minutes). If yes sets 1 if not 0

SessionExeceds AS (
    SELECT
        user_id,
        session_id,
        event,
        timestamp,
        previous_timestamp,
        (gap_seconds > 300) OR (previous_timestamp IS NULL) AS session_exceeded
    FROM
        SessionGaps
)

4. Calculates sum of how many times a different event exceeded in one session, so we can be able to create a unique identifier for these. We should treat these as new sessions.

AdjustedSessions AS (
    SELECT
        user_id,
        session_id,
        event,
        timestamp,
        previous_timestamp,
        SUM(session_exceeded) OVER (PARTITION BY user_id, session_id ORDER BY timestamp) AS exceeded_count
    FROM
        SessionExceeds
)

What is SUM with OVER (PARTITION BY ...) Doing?

Cumulative Sum: The SUM function within the OVER clause is performing a cumulative sum (running total) of the session_exceeded values as you move down the list of events ordered by timestamp.

Partitioning: The PARTITION BY user_id, session_id ensures that this cumulative sum restarts for each new combination of user_id and session_id. This is crucial because it allows the query to track the session breaks separately for each user and session.

Tracking Session Breaks: Whenever session_exceeded is 1, it increments the cumulative sum (exceeded_count). This effectively “marks” a new sub-session starting from that point. Subsequent rows within the same original session will have an incremented exceeded_count, signifying they belong to a new logical sub-session.

5. Generates new session identifiers with the sessions that exceeded by concatenating session_id and the exceeded count in the value.

NewSessionIDs AS (
    SELECT
        user_id,
        session_id || '-' || CAST(exceeded_count AS TEXT) AS new_session_id,
        event,
        timestamp,
        previous_timestamp
    FROM
        AdjustedSessions
)

We can see that the session_id = ‘2b20e77e9168443e9_12‘ was split into 5 separate new sessions, due to the fact that five events lasted longer than 5 minutes. Take note of the new_session_id column. If we did not apply the above SQL logic, this session_duration_minutes would be 121 minutes (total of all session_duration_minutes).


In summary, the SQL script takes event data and breaks down user sessions into smaller, more detailed sub-sessions. By identifying significant gaps in time between events, the script splits longer sessions into shorter, more relevant segments. This approach helps us better understand how users interact over time by focusing on these smaller segments.

With this method, we can see patterns in user behavior that might not have been clear in the overall session data. This clearer picture of user activity can help us make more informed decisions, whether we’re looking to improve user engagement or enhance the user experience.

As we’ve seen, paying close attention to the details in your data can reveal issues that might otherwise go unnoticed. By carefully analyzing your data and applying the right fixes, you can ensure that your results are accurate and truly reflective of the behavior that you are analyzing.

I hope this has helped you see how important it is to spot and fix data issues so you can make better, data-driven decisions!

TechOply avatar

Posted by

Leave a comment

Discover more from TechOply

Subscribe now to keep reading and get access to the full archive.

Continue reading