文章/答案/技术大牛

发布

社区首页 >问答首页 >我如何解码这个字符串，它表示为unicode？

问我如何解码这个字符串，它表示为unicode？
EN

Stack Overflow用户

提问于 2018-06-26 06:26:09

回答 2查看 389关注 0票数 2

当我试图通过可读性解析网页时，我得到了s (Windows10上的Python2.7，崇高文本2/cmd)

>>> import requests
>>> from readability import Document
>>>
>>> response = requests.get('http://www.gamersky.com/news/201806/1064930.shtml')
>>> doc = Document(response.text.encode("utf-8"))
>>> print doc.title()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\xe3' in position 0: illegal multibyte sequence
>>> print doc.title().encode("utf-8")
lots of messy codes
>>> print doc.title().encode("utf-16")
lots of messy codes
>>> print doc.title().encode("gbk")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\xe3' in position 0: illegal multibyte sequence

我发现我永远无法打印出doc.title()，所以我通过运行doc.title()来查看

s = repr(doc.title())
print type(doc.title())
print s

结果很奇怪

<type 'unicode'>
u'\xe3\x80\x8a\xe5\xa5\x87\xe5\xbc\x82\xe4\xba\xba\xe7\x94\x9f\xe3\x80\x8b\xe5\x9b\xa2\xe9\x98\x9f\xe6\x96\xb0\xe4\xbd\x9c\xe3\x80\x8a\xe8\xb6 \xe8\x83\xbd\xe9\x98\x9f\xe9\x95\xbf\xe3\x80\x8b\xe5 \x8d\xe8\xb4\xb9\xe4\xb8\x8b\xe8\xbd\xbd \xe5\xb0\x8f\xe7\x94\xb7\xe5\xad\xa9\xe7\x9a\x84\xe8\x8b\xb1\xe9\x9b\x84\xe6\xa2\xa6\xe6\x83\xb3 _ \xe6\xb8\xb8\xe6\xb0\x91\xe6\x98\x9f\xe7\xa9\xba GamerSky.com'

看起来s实际上是用多字节编码的，因为当我运行

 print '\xe3\x80...'

然后打印出来

《奇异人生》团队新作《? 能队长》? ?费下载 小男孩的英雄梦想 _ 游民星空 GamerSky.com

准确的标题是

《奇异人生》团队新作《超能队长》免费下载 小男孩的英雄梦想 _ 游民星空 GamerSky.com

虽然仍然缺少一些字符，但结果使我相信\xe3不应该表示为unicode形式。经过一些搜索，我发现下面的代码很有帮助，但仍然有一些字符缺失。

>>> print s.encode("raw_unicode_escape")
《奇异人生》团队新作《? 能队长》? ?费下载 小男孩的英雄梦想 _ 游民星空 GamerSky.com

我的问题是：

为何会出现这个问题呢？encode("raw_unicode_escape")解决方案整洁吗？当我运行以下代码时，它可以运行从可读性导入文档导入>>>请求>>> >>> response = requests.get('https://zh.wikipedia.org/wiki/Wikipedia:%E9%A6%96%E9%A1%B5') >>> doc = Document(response.text.encode("utf-8")) >>> print doc.title()维基百科，自由的百科全书
如何处理丢失的字符？

python-2.7

unicode

python-unicode

python

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-06-26 06:36:33

问题是，当您使用response.text时，它会猜测在将response.content解码为unicode时编码是什么。在这种情况下，它的猜测是不正确的。您必须强制进行编码，方法是将response.encoding设置为'utf-8'，每个文档。

import requests
from readability import Document
response = requests.get('http://www.gamersky.com/news/201806/1064930.shtml')
response.encoding = 'utf-8'
doc = Document(response.text)
print doc.title()

这些指纹是：

《奇异人生》团队新作《超能队长》免费下载 小男孩的英雄梦想 _ 游民星空 GamerSky.com

票数 1

Stack Overflow用户

发布于 2018-06-26 06:32:52

尝试使用response.content

Ex:

>>> import requests
>>> from readability import Document
>>>
>>> response = requests.get('http://www.gamersky.com/news/201806/1064930.shtml')
>>> doc = Document(response.content)
>>> print doc.title()

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/51036156

复制

相似问题

问我如何解码这个字符串，它表示为unicode？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何解码这个字符串，它表示为unicode？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我如何解码这个字符串，它表示为unicode？
EN