Web source retrieval
Incident Report for Ouriginal
Postmortem

This has been an isolated EU incident so any reports generated using our US endpoint (us.ouriginal.com/api) is NOT affected.

 

Course of events

We halted processing of documents at around 23:00 CEST Friday since we saw that quality was poor for many reports produced. This was caused by a service fetching external sources which was unstable on several servers and remained unnoticed for a couple of days. After the problem was resolved on Saturday at 09:00 CEST, we restarted the processing and were done melting through the queue that had been built up since Friday at about 17:00 CEST Saturday.

The problem with poor quality affected many but not all reports since report generation is distributed over several servers and on some of the servers, the service was still working correctly. Since not all reports were affected, it caused the incident to fly under the radar for a period.

All reports from Wednesday 00:01 CEST for all documents where we saw that quality was lower than expected, were enqueued again. We melted that queue by approximately 04:00 CET today (Monday morning).

Yesterday evening (Sunday), we started pushing the new results back to Canvas since Canvas will keep the old result until we have pushed the new. As a parallel effort, have also implemented a message in the Report view, indicating that there is a new report available if accessing an old report, with a link to the new report. If there is no message in the report, this means that the user is looking at the new report already.

The root cause of the incident was a stability issue in the service fetching external sources, causing the service to crash when exposed to high loads under specific circumstances.

 

How do we prevent this from happening again?

Following our immediate actions to mitigate the effects of the incident, we are also addressing the stability issue of the service fetching external sources. We are also implementing a failover so that if the service fetching external sources goes down, the job fetching the sources will automatically be distributed to a new instance.

 

How does this affect your organisation?

There are no immediate actions you need to take since we are taking measures to minimize the inconvenience for all customers. You may however want to convey this message or communicate to your users that any report produced between Wednesday and Friday that was already reviewed should be reopened to see if there is a new report available. A message with the link to the new report will appear if an old report is opened. If no such message is visible, it means that the report being viewed is the latest version of it, or the report was not affected by this incident. The message in the report went live today (Monday) at 10:00 CET.

 

We apologize for the inconvenience this may have caused you.

Yours sincerely,

Team Ouriginal

Posted Nov 01, 2021 - 12:00 UTC

Resolved
This incident has been resolved.

Processing of documents has been up since 9 am Saturday CEST.

Queue of re-analyzed documents has been processed and a message in View, when accessing an old report indicating that a newer report is available is live.

A postmortem will follow.
Posted Nov 01, 2021 - 11:39 UTC
Update
Continuing to create new reports for the ones with insufficient results. From tomorrow Monday at 10 CET a message will appear in view7 if there is a new report available with a link to the newer report.
Posted Oct 31, 2021 - 20:02 UTC
Update
Queue handled, we are creating new reports for the ones with insufficient results.
Posted Oct 31, 2021 - 09:00 UTC
Update
Queue handled, we will start to create new reports for the ones with insufficient results.
Posted Oct 30, 2021 - 15:51 UTC
Update
We are continuing to monitor, processing queue but expect delays.
Posted Oct 30, 2021 - 14:04 UTC
Update
We are continuing to monitor, processing queue but expect delays.
Next update in 2 h.
Posted Oct 30, 2021 - 12:08 UTC
Monitoring
A fix for the problem has been implemented, we are processing queue but expect delays.
We are sorry for caused inconvenience.
Next update in 2 h.
Posted Oct 30, 2021 - 10:03 UTC
Update
We are still working on a fix for the problem with external source retrieval.
We are sorry for caused inconvenience.
Next update in 2 h.
Posted Oct 30, 2021 - 07:56 UTC
Update
We are still working on a fix for the problem with external source retrieval.
Next update in 2 h.
Posted Oct 30, 2021 - 05:51 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 29, 2021 - 22:32 UTC
Investigating
We currently experiencing issues some issues with our external source retrieval. We're currently investigating the issue to try to find the root cause. During this time some reports have been produced with insufficient results and we'll investigate and make new reports for those ones as soon as the issue have been resolved. During the investigation we've stopped processing with delays in report generation as an effect. We'll update as soon as possible.
Posted Oct 29, 2021 - 21:45 UTC
This incident affected: Report Processing.