Unicodedecodeerror utf 8 Codec Can t Decode Byte 0xd4 in Position 5 Invalid Continuation Byte
If you are getting trouble with the error "Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte", take it easy and follow our article to overcome the problem. Read on it now.
Reason for "Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte " error
This problem is common when reading a file under CSV format in pandas. It happens because the read_csv() function in pandas uses utf-8 Standard Encodings, which is defaulted in Python, but the file contains some special characters.
Now, we will read a CSV file about the biomedical domain by pandas and how the error happens.
You can download the CVS file here.
Code:
import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv") data.head()
Result:
UnicodeDecodeError Traceback (most recent call last) <ipython-input-76-0c9089169b2f> in <module> 1 import pandas as pd ----> 2 a = pd.read_csv('/content/drive/MyDrive/LearnShareIT/alldata_1_for_kaggle.csv') /usr/local/lib/python3.7/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error() UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 3: invalid start byte
Note: You may get the same error with format like that: UnicodeDecodeError: 'utf-8' codec can't decode byte <<memory address>> in position <<position>> : invalid start byte error .
Solutions to solve this problem
Solution for reading csv file:
Some common encodings can bypass the codecs lookup machinery to improve performance such as latin1, iso-8859-1, ascii, us-ascii, etc.
You can pass a parameter named "encoding" with a string value which defines the type of encoding to perform the data.
In our example, we use "latin1" to encode the data.
Code:
import pandas as pd data = pd.read_csv("alldata_1_for_kaggle.csv", encoding = 'latin1') # pass encoding parameter data.head()
Result:
Unnamed: 0 0 a 0 0 Thyroid_Cancer Thyroid surgery in children in a single insti... 1 1 Thyroid_Cancer " The adopted strategy was the same as that us... 2 2 Thyroid_Cancer coronary arterybypass grafting thrombosis ï¬b... 3 3 Thyroid_Cancer Solitary plasmacytoma SP of the skull is an u... 4 4 Thyroid_Cancer This study aimed to investigate serum matrix ...
Solution for reading text and json file:
The initial content of json and txt file:
{"student":[ { "firstName":"™œœ''™™œ""××""™"ˆ'γ°°'ˆ'"œ™"ε""Ãö", "lastName":"Doe" }, { "firstName":"Anna", "lastName":"Smith" }, { "firstName":"Peter", "lastName":"Jones" } ] }
œMedical Informatics and œHealth Care Sciences
Open file and read with binary mode
syntax: file_reader = open("path/to/file", "rb") with rb is binary reading mode
Read json file:
import json file = open('a.json', 'rb') content = json.load(file) print(content)
Result:
{'student': [{'firstName': "™œ\x9dœ\x9d''™™œ\x9d""××""™"ˆ'γ°°'ˆ'"œ\x9d™"ε""Ã\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}
Read text file:
file = open('a.txt', 'rb') print(file.read())
Result:
b'\xc5\x93Medical Informatics\xc2\x9d and \xc5\x93Health Care Sciences'
Ignoring errors when reading file
Syntax: file = open("path/to/file", "r", errors="ignore" to ignore encoding errors can lead to data loss.
Read json file:
import json file = open('a.json', 'r', errors = 'ignore') content = json.load(file) print(content)
Reuslt:
{'student': [{'firstName': "â„¢Å"ÂÅ"Â''™™Å"Ââ€â€œÃƒâ€"Ãâ€"â€â€â„¢â€œË†â€™ÃŽÂ³Â°Â°â€™Ë†â€™â€œÅ"™“ε““ÃÂ\xadö", 'lastName': 'Doe'}, {'firstName': 'Anna', 'lastName': 'Smith'}, {'firstName': 'Peter', 'lastName': 'Jones'}]}
Read txt file:
file = open('a.txt', 'r', errors='ignore') print(file.read())
Result:
Å"Medical Informatics and Å"Health Care Sciences
Summary
Unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte is a common error when reading files. Through our article, hope you understand the root of the problem and the solution to the problem.
Maybe you are interested:
- UnicodeDecodeError: 'ascii' codec can't decode byte
- UnicodeEncodeError: 'ascii' codec can't encode character in position
- AttributeError: 'dict' object has no attribute 'iteritems'
Full Name: Huan Nguyen
Name of the university: HUST
Major: IT
Programming Languages: Python, C, C++, Machine Learning/Deep Learning/NLP
Source: https://learnshareit.com/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-start-byte/