Another Microsoft Exchange failure

There’s been yet another failure of Microsoft’s Exchange Online service (aka Microsoft 365 hosted email) affecting customers around the world by stopping or delaying outgoing messages. Microsoft hid the extent of the problem by spreading the details across two separate bug reports.

The first most people knew of the problem was sending messages that gave an error “503 5.5.1 Bad sequence of commands” or an NDR (Non-Delivery Report).

As usual Microsoft was coy about the extent of the problem, saying only that “some” users “may” be affected. However, reports were coming in from around the globe, so the bug wasn’t limited to a single server farm or region.

According to Microsoft the problem started before 5am UTC Tuesday 18th July and was reported fixed by Microsoft about 3.5 hours later (8:25am UTC). Microsoft’s explanation is in ticket EX649175.

BUT that turned out not to be the whole story because later there appeared another error report ( EX648815 ) “Users’ inbound and outbound email delivery may be delayed for 15 minutes or longer in Exchange Online“. This delayed mail sending happened for OVER a day (17 July 3pm to 18 July 7:30pm).

What happened?

BOTH these problems happened because of the same fault in the Exchange Online system.

Turns out the problem was a change to the free/busy system which overloaded with false messages and caused a problem with email service.

“… a recent change caused a large influx of free/busy Control Flow Messages (CFM) to be sent to the Exchange online infrastructure, resulting in stale free-busy information.
An interceptor rule was applied to discard these CFM messages to mitigate impact for the free/busy issue, which inadvertently caused impact to mail flow.”

Source: Microsoft

What to do

If you tried to send a message and received one of the error messages mentioned above, you’ll have to resend those messages. Curiously, there’s apparently no way for Microsoft to identify the messages and resend them automatically?