This has been an isolated EU incident so any reports generated using our US endpoint (us.ouriginal.com/api) is NOT affected.
Course of events
We halted processing of documents at around 23:00 CEST Friday since we saw that quality was poor for many reports produced. This was caused by a service fetching external sources which was unstable on several servers and remained unnoticed for a couple of days. After the problem was resolved on Saturday at 09:00 CEST, we restarted the processing and were done melting through the queue that had been built up since Friday at about 17:00 CEST Saturday.
The problem with poor quality affected many but not all reports since report generation is distributed over several servers and on some of the servers, the service was still working correctly. Since not all reports were affected, it caused the incident to fly under the radar for a period.
All reports from Wednesday 00:01 CEST for all documents where we saw that quality was lower than expected, were enqueued again. We melted that queue by approximately 04:00 CET today (Monday morning).
Yesterday evening (Sunday), we started pushing the new results back to Canvas since Canvas will keep the old result until we have pushed the new. As a parallel effort, have also implemented a message in the Report view, indicating that there is a new report available if accessing an old report, with a link to the new report. If there is no message in the report, this means that the user is looking at the new report already.
The root cause of the incident was a stability issue in the service fetching external sources, causing the service to crash when exposed to high loads under specific circumstances.
How do we prevent this from happening again?
Following our immediate actions to mitigate the effects of the incident, we are also addressing the stability issue of the service fetching external sources. We are also implementing a failover so that if the service fetching external sources goes down, the job fetching the sources will automatically be distributed to a new instance.
How does this affect your organisation?
There are no immediate actions you need to take since we are taking measures to minimize the inconvenience for all customers. You may however want to convey this message or communicate to your users that any report produced between Wednesday and Friday that was already reviewed should be reopened to see if there is a new report available. A message with the link to the new report will appear if an old report is opened. If no such message is visible, it means that the report being viewed is the latest version of it, or the report was not affected by this incident. The message in the report went live today (Monday) at 10:00 CET.
We apologize for the inconvenience this may have caused you.