User:Jorgk/8-bit bytes and e-mail corruption at Verizon, Yahoo, etc.
- 1 All you ever need to know about 8bit bytes and corrupted e-mail passing through the servers of Verizon, Yahoo, Bellsouth, AT&T, Sbcglobal and others
- 1.1 A brief excursion into the history of encodings
- 1.2 Email transmission
- 1.3 8BITMIME and Yahoo
- 1.4 Why does my e-mail contain 8bit characters if I'm only writing English text
- 1.5 What is the problem when sending email though a Yahoo SMTP server
- 1.6 What can be done at Yahoo
- 1.7 What can be done in Thunderbird
All you ever need to know about 8bit bytes and corrupted e-mail passing through the servers of Verizon, Yahoo, Bellsouth, AT&T, Sbcglobal and others
A brief excursion into the history of encodings
According to this Wikipedia article, the 8bit byte was introduced as a basic unit of addressable digital information back in the 1960ies. Back in those days, both ASCII and EBCDIC encodings were invented. ASCII is a 7bit encoding only using the lower seven bits of the 8bit byte whereas EBCDIC is an 8bit encoding using all eight bits. When computers became more popular and where used internationally, other "local" 8bit encodings were invented, for example the ISO-8859 character sets, ISO-8859-1 for "Western" Latin-based languages containing German, French and Spanish characters (äöü, áéíóú, etc.), ISO-8859-7 for Greek and many more. Microsoft later enhanced those encodings by adding, for example, the Euro sign (€), so ISO-8859-1 was replaced by windows-1252, and, for example, ISO-8859-7 was replaced by windows-1253. All these one-byte 8bit encodings still didn't cater for languages using a using a larger set of characters, like Chinese, Japanese and Korean (usually summarised as CJK). So for these, other encodings were created. In the end, Unicode and the UTF-8 encoding gained prevalence. UTF-8 is an 8bit encoding where each character is encoded using one or more bytes. In case one byte is used, the byte represents a 7bit ASCII character, in case more bytes are used, they are all have the highest bit set. In other words, a single byte with the highest bit set is not valid in UTF-8.
This is all you need to know to understand the following. Note that in the following text a highbit byte is an 8bit byte with the highest bit set, so >= 80 in hexadecimal. An 8bit character is one that can't be encoded in plain 7bit ASCII but needs to be encoded using highbit bytes.
It's important to know that in the early days, some email servers could only transmit the lower seven bits (Content Transfer Encoding 7bit), but these days most email servers can handle highbit bytes (CTE 8bit). When sending an email the SMTP server will be queried whether it has the so-called 8BITMIME capability. However, Thunderbird ignores the result since almost all modern servers support 8BITMIME, some without advertising it. However, as we can see in the next section, Yahoo advertises 8BITMIME.
8BITMIME and Yahoo
Yahoo servers support 8BITMIME (at time of writing 30 July 2019). To prove this run
telnet smtp.mail.yahoo.com 25 followed by
EHLO Yahoo. The result is:
250-smtp408.mail.ir2.yahoo.com Hello Yahoo [220.127.116.11]) 250-PIPELINING 250-ENHANCEDSTATUSCODES 250-8BITMIME 250-SIZE 41697280 250 STARTTLS
Why does my e-mail contain 8bit characters if I'm only writing English text
Many users think that they shouldn't be affected by 8bit woes since they are only writing English text. That is not the case. As soon as they type two consecutive spaces into an email, they create 8bit characters since the first space is encoded as so-called non-breaking space encoded as one byte A0 in windows-1252 or two bytes C2 A0 in UTF-8. Also, English writers sometimes use smart quotes or "long" dashes which cannot be encoded with a 7bit ASCII character, so an 8bit character is used.
What is the problem when sending email though a Yahoo SMTP server
Mail servers run by Verizon, Yahoo, Bellsouth, AT&T, Sbcglobal and others are known to be defective since February 2018. The alternate between two defective states, at times they destroy all 8bit characters, other times the only destroy 8bit characters of non-UTF-8 encodings, more below. The author of this article has sent countless test mails since February 2018. 8bit support at Yahoo & Co. never worked, one of the to following corruptions were always observed.
State 1: All 8bit characters are destroyed
In this case, all highbit bytes in all encodings are replaced with a question mark ?. Example: A user sends
Hi Bob, raining today.. Bob receives:
Hi Bob,? raining today. or
Hi Bob,?? raining today. depending of whether the e-mail was sent encoded in windows-1252, so the first space was transmitted as A0, or in UTF-8 where the first space was transmitted as C2 A0. Needless to say that any other 8bit characters, like äöü, áéíóú any Greek or CJK text is completely obliterated since highbit bytes are used regardless of the encoding. For example
H??gar, again, depending on the encoding used, either one byte in windows-1252 or two in UTF-8.
State 2: All highbit bytes are interpreted as UTF-8, other encodings destroyed
This is the more puzzling corruption. When Yahoo servers behave like this, all messages encoded in UTF-8 pass through the servers unharmed. However, messages using other encodings, like windows-1252 are corrupted in a fiendish way. In general, mail servers should not interpret the message content and pass it through unchanged, however, Yahoo violates this and modifies email content. When in this state, Yahoo interprets all highbit bytes in a massage as UTF-8. If a message contains a highbit bytes which is not valid in the UTF-8 encoding, it is replaced with the so-called UTF-8 replacement character � encoded as three bytes EF BF BD. We need to remember that single highbit bytes which are used in the ISO-8859 or windows-12* encodings are all invalid if interpreted as UTF-8 since in UTF-8 the first highbit byte must be followed by at least one more highbit byte. So the windows-1252 non-breaking space A0 or the windows-1252 encoded letter "ä", E4 as a byte, are all replaced by EF BF BD. But it gets worse. The receiving email client which honours the encoding of the message, for example, displays it as windows-1252. And in windows-1252, the byte sequence EF BF BD is displayed as the string ï¿½. Yes, the letter "i" with diaeresis, a Spanish opening ¿ and the fraction ½. Example: A user sends
Hi Bob, raining today.. Bob receives:
Hi Bob,ï¿½ raining today.. Needless to say that messages using 8bit ISO-8859 or windows-12* encodings, like Greek, will be totally obliterated.
What can be done at Yahoo
In reality this horrible defect at Yahoo & friends should have caused a storm of complaints. The author of this article has complained on Yahoo forums twice before those help forums were shut down. The Thunderbird community manager has approached who he thought was a responsible manager at Verizon various times. No change in Yahoo's behavior other than switching from state 1 to state 2 or vice versa. The email sent to Yahoo in June 2019 can be found here. It only describes state 2 since that was the behaviour at time of writing.
What can be done in Thunderbird
Thunderbird can be instructed to send all messages in pure 7bit ASCII using the so called quoted printable (QP) encoding where all highbit bytes are transmitted as their hexadecimal value preceded by an "=" sign. So the letter "ä" or E4 in windows-1252 is transmitted as the string =E4. Obviously this is not very efficient since the amount of content to be transmitted and stored on mail servers is tripled. That's also why QP encoding is not the default in Thunderbird. To switch on QP encoding, set the preference
mail.strictly_mime in the configuration editor
Tools > Options, Advanced, General tab, Config Editor, paste
mail.strictly_mime. Note that on Mac and Linux it's under
Edit > Preferences.