User:Jorgk/8-bit bytes and e-mail corruption at Verizon, Yahoo, etc.

From MozillaWiki
Jump to: navigation, search

All you ever need to know about 8bit bytes and corrupted e-mail passing through the servers of Verizon, Yahoo, Bellsouth, AT&T, Sbcglobal and others

A brief excursion into the history of encodings

According to this Wikipedia article, the 8bit byte was introduced as a basic unit of addressable digital information back in the 1960ies. Back in those days, both ASCII and EBCDIC encodings were invented. ASCII is a 7bit encoding only using the lower seven bits of the 8bit byte whereas EBCDIC is an 8bit encoding using all eight bits. When computers became more popular and where used internationally, other "local" 8bit encodings were invented, for example the ISO-8859 character sets, ISO-8859-1 for "Western" Latin-based languages containing German, French and Spanish characters (äöü, áéíóú, etc.), ISO-8859-7 for Greek and many more. Microsoft later enhanced those encodings by adding, for example, the Euro sign (€), so ISO-8859-1 was replaced by windows-1252, and, for example, ISO-8859-7 was replaced by windows-1253. All these one-byte 8bit encodings still didn't cater for languages using a using a larger set of characters, like Chinese, Japanese and Korean (usually summarised as CJK). So for these, other encodings were created. In the end, Unicode and the UTF-8 encoding gained prevalence. UTF-8 is an 8bit encoding where each character is encoded using one or more bytes. In case one byte is used, the byte represents a 7bit ASCII character, in case more bytes are used, they are all have the highest bit set. In other words, a single byte with the highest bit set is not valid in UTF-8.

This is all you need to know to understand the following. Note that in the following text a highbit byte is an 8bit byte with the highest bit set, so >= 80 in hexadecimal. An 8bit character is one that can't be encoded in plain 7bit ASCII but needs to be encoded using highbit bytes.

Email transmission

It's important to know that in the early days, some email servers could only transmit the lower seven bits (Content Transfer Encoding 7bit), but these days most email servers can handle highbit bytes (CTE 8bit). When sending an email the SMTP server will be queried whether it has the so-called 8BITMIME capability. However, Thunderbird ignores the result since almost all modern servers support 8BITMIME, some without advertising it. However, as we can see in the next section, Yahoo advertises 8BITMIME.

8BITMIME and Yahoo

Yahoo servers support 8BITMIME (at time of writing 30 July 2019). To prove this run telnet smtp.mail.yahoo.com 25 followed by EHLO Yahoo. The result is:

250-smtp408.mail.ir2.yahoo.com Hello Yahoo [95.23.45.31])
250-PIPELINING
250-ENHANCEDSTATUSCODES
250-8BITMIME
250-SIZE 41697280
250 STARTTLS

Why does my e-mail contain 8bit characters if I'm only writing English text

Many users think that they shouldn't be affected by 8bit woes since they are only writing English text. That is not the case. As soon as they type two consecutive spaces into an email, they create 8bit characters since the first space is encoded as so-called non-breaking space encoded as one byte A0 in windows-1252 or two bytes C2 A0 in UTF-8. Also, English writers sometimes use smart quotes or "long" dashes which cannot be encoded with a 7bit ASCII character, so an 8bit character is used.

What is the problem when sending email though a Yahoo SMTP server

Mail servers run by Verizon, Yahoo, Bellsouth, AT&T, Sbcglobal and others are known to be defective since February 2018. The alternate between two defective states, at times they destroy all 8bit characters, other times the only destroy 8bit characters of non-UTF-8 encodings, more below. The author of this article has sent countless test mails since February 2018. 8bit support at Yahoo & Co. never worked, one of the to following corruptions were always observed.

State 1: All 8bit characters are destroyed

In this case, all highbit bytes in all encodings are replaced with a question mark ?. Example: A user sends Hi Bob, raining today.. Bob receives: Hi Bob,? raining today. or Hi Bob,?? raining today. depending of whether the e-mail was sent encoded in windows-1252, so the first space was transmitted as A0, or in UTF-8 where the first space was transmitted as C2 A0. Needless to say that any other 8bit characters, like äöü, áéíóú any Greek or CJK text is completely obliterated since highbit bytes are used regardless of the encoding. For example Hägar becomes H?gar or H??gar, again, depending on the encoding used, either one byte in windows-1252 or two in UTF-8.

State 2: All highbit bytes are interpreted as UTF-8, other encodings destroyed

This is the more puzzling corruption. When Yahoo servers behave like this, all messages encoded in UTF-8 pass through the servers unharmed. However, messages using other encodings, like windows-1252 are corrupted in a fiendish way. In general, mail servers should not interpret the message content and pass it through unchanged, however, Yahoo violates this and modifies email content. When in this state, Yahoo interprets all highbit bytes in a massage as UTF-8. If a message contains a highbit bytes which is not valid in the UTF-8 encoding, it is replaced with the so-called UTF-8 replacement character � encoded as three bytes EF BF BD. We need to remember that single highbit bytes which are used in the ISO-8859 or windows-12* encodings are all invalid if interpreted as UTF-8 since in UTF-8 the first highbit byte must be followed by at least one more highbit byte. So the windows-1252 non-breaking space A0 or the windows-1252 encoded letter "ä", E4 as a byte, are all replaced by EF BF BD. But it gets worse. The receiving email client which honours the encoding of the message, for example, displays it as windows-1252. And in windows-1252, the byte sequence EF BF BD is displayed as the string �. Yes, the letter "i" with diaeresis, a Spanish opening ¿ and the fraction ½. Example: A user sends Hi Bob, raining today.. Bob receives: Hi Bob,� raining today.. Needless to say that messages using 8bit ISO-8859 or windows-12* encodings, like Greek, will be totally obliterated.

What can be done at Yahoo

In reality this horrible defect at Yahoo & friends should have caused a storm of complaints. The author of this article has complained on Yahoo forums twice before those help forums were shut down. The Thunderbird community manager has approached who he thought was a responsible manager at Verizon various times. No change in Yahoo's behavior other than switching from state 1 to state 2 or vice versa. The email sent to Yahoo in June 2019 can be found here. It only describes state 2 since that was the behaviour at time of writing.

What can be done in Thunderbird

Thunderbird can be instructed to send all messages in pure 7bit ASCII using the so called quoted printable (QP) encoding where all highbit bytes are transmitted as their hexadecimal value preceded by an "=" sign. So the letter "ä" or E4 in windows-1252 is transmitted as the string =E4. Obviously this is not very efficient since the amount of content to be transmitted and stored on mail servers is tripled. That's also why QP encoding is not the default in Thunderbird. To switch on QP encoding, set the preference mail.strictly_mime in the configuration editor Tools > Options, Advanced, General tab, Config Editor, paste mail.strictly_mime. Note that on Mac and Linux it's under Edit > Preferences.