Partial data capture issues (web & mobile)
Incident Report for Fullstory
Postmortem

Postmortem

2024/11/13 Failed Tasks Not Retried / Pages Not Initialized

On October 31st, 2024, an issue introduced in our service caused a small percentage (< 0.05%) of sessions to not process a portion of data within them. This data was effectively “stuck” in our system. The issue lasted until November 13th, when it was fixed for all future sessions.

Customer Impact

The issue caused the affected sessions to be missing data, causing gaps/disruptions when loading playback, webhooks not firing, and events not being exported via the Data Direct feature.

Affected sessions captured after Nov 6th were identified and frozen so that we could re-process the data. Starting the week of Dec 9th, we ran all the frozen data back through our pipeline and the majority of it was recovered successfully for sessions with enough data to process.

Affected sessions before Nov 6th will remain in this incomplete state, as the data that had been unprocessed got deleted as it was misidentified as something that our system needed to clean up and remove, due to being stuck for some time.

We apologize for any impact this may have had on your business, and are working to better understand how to identify issues of this nature more quickly, and resolve them with even less disruption.

Root Cause

Due to a code change that altered the behavior of an uncommon failure condition, part of our system marked the data in question to be “completed” when it had actually failed. Normally in this case, it automatically retries this operation and the data gets processed, but did not do so during the incident period.

Due to this being a rare circumstance, the issue was not immediately noticed by our typical monitoring systems.

Resolution

After noticing some odd system behavior and finding that we had unprocessed data that should have been successful, we quickly traced it back to the code change, and deployed a fix immediately that would ensure no future data gets into this state.

Process Changes and Prevention

Actions Taken:

  • Fix the issue for future sessions.
  • Freeze and reprocess all affected sessions still in our system.
  • Added more automated testing around this failure case.
  • Adjusted our monitoring and alerting to better identify “stuck” data that’s legitimate, even when it’s a very small percentage of overall processing.

Ongoing Improvements:

  • Applying lessons learned to other parts of the system, including better detection systems and testing for any other components that process data in a similar way.
  • Better automated observability triaging so that rare-circumstance issues don’t blend in with general service metric variability. 

We deeply regret this incident and invite any FullStory customer who was materially affected to contact support@fullstory.com. We stand by ready to fully address all of your concerns.

Posted Dec 10, 2024 - 16:32 EST

Resolved
This issue is now resolved. Most accounts will notice a small percentage of sessions affected by this issue. This would result in entirely missing sessions, or pages in sessions, which would also impact Search, Conversions, Metrics, Funnels, and Dashboards.

We’re actively remediating the sessions from November 6th - 13th. After remediation, sessions and analytics will be fully recovered. Data sent to a data warehouse or via a webhook for this time period will be delayed.

If you watch a session from October 31 - November 5th and experience skipping of playback, it’s likely caused by this issue. If you would like to confirm that your account was impacted, please reach out to support@fullstory.com.
Posted Nov 14, 2024 - 17:12 EST
Monitoring
Our team has identified an issue impacting data capture for web and mobile sessions that started October 31st, 2024 at 4:03pm ET. This issue impacted 4,770 Fullstory accounts, but only impacts about .5% of total captured sessions per account.

Due to our internal retention timeframe, sessions impacted by this issue prior to November 6, 2024 at 5:28pm ET are nonrecoverable. We increased our retention to ensure no further impacted session data would be lost. A fix for the issue has been deployed and we're monitoring this further.
Posted Nov 13, 2024 - 19:11 EST
This incident affected: Data Capture (Web Capture, Native Mobile Capture) and Fullstory Web Application.