docklet/tools/python_demo.ipynb

494 lines
12 KiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 用Python分析《美女与野兽》"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在一篇最近发表的论文*A quantitative analysis of gendered compliments in Disney Princess films*中Carmen Fought和Karen Eisenhauer发现在这部迪士尼经典影片中女性角色的对话要多于迪士尼近期的电影作品。作者在网络上发现了美女与《野兽》的脚本因此我立刻用Python重做了他们的分析。\n",
"<br />更多地我在文章最后加入了对《玩具总动员》的分析这个脚本的形式完全不同但其中91%的对白来自男性角色。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"点击下边的cell点击上方工具栏里的执行图标即可执行代码块看到输出结果。代码块左边的In[]出现In[*]表示代码正在执行"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from __future__ import division\n",
"\n",
"import re\n",
"from collections import defaultdict\n",
"\n",
"import requests\n",
"import pandas as pd\n",
"import matplotlib\n",
"\n",
"%matplotlib inline\n",
"matplotlib.style.use('ggplot')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Load the script which comes as a text file\n",
"\n",
"script_url = 'http://www.fpx.de/fp/Disney/Scripts/BeautyAndTheBeast.txt'\n",
"script = requests.get(script_url).text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们看下脚本的开篇:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Let's look at the beginning of the script\n",
"\n",
"script.splitlines()[:20]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"再在中间随意选取一段:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Let's look at a random place\n",
"\n",
"script.splitlines()[500:520]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"看上去很容易分析,因为角色和对白间用:隔开"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# seems fairly easy to parse since \n",
"# each new speaking line has : and begins with all caps\n",
"\n",
"def remove_spaces(line):\n",
" # remove the weird spaces\n",
" return re.sub(' +',' ',line)\n",
"\n",
"def remove_paren(line):\n",
" # remove directions that are not spoken\n",
" return re.sub(r'\\([^)]*\\)', '', line)\n",
"\n",
"\n",
"lines = []\n",
"line = ''\n",
"for row in script.splitlines():\n",
" if ': ' in row and row[:3].upper() == row[:3]:\n",
" line = remove_spaces(line)\n",
" line = remove_paren(line)\n",
" lines.append(line)\n",
" line = row\n",
" elif ' ' in row:\n",
" line = line + ' ' + row.lstrip()\n",
"# don't forget the last line\n",
"lines.append(remove_spaces(line))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"lines[:15]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"看看结尾什么样:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# How does the end look\n",
"\n",
"lines[-5:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# 我们去掉可能的空白行\n",
"\n",
"print (len(lines))\n",
"lines = [l for l in lines if len(l) > 0]\n",
"print (len(lines))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"现在,我们找出所有角色,并计算他们的出场次数(对白数)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# now figure out the roles and how many times they appear\n",
"\n",
"roles = defaultdict(int)\n",
"\n",
"for line in lines:\n",
" # take advantage of the fact that the speaker is always listed before the :\n",
" speaker = line.split(':')[0]\n",
" roles[speaker] = roles[speaker] + 1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"len(roles)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"看一下每个角色出现的相对频率:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# take a look at the relative frequency of each role\n",
"roles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"看起来有一行“to think about”是乱入的恰好满足了parse条件我们忽略它"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Looks like there is one bum line ('to think about'')\n",
"# But I'll ignore that for now.\n",
"\n",
"# Quickly eye ball which roles are female and which are possibly mixed groups.\n",
"\n",
"females = ['WOMAN 1',\n",
" 'WOMAN 2',\n",
" 'WOMAN 3',\n",
" 'WOMAN 4',\n",
" 'WOMAN 5',\n",
" 'OLD CRONIES',\n",
" 'MRS. POTTS',\n",
" 'BELLE',\n",
" 'BIMBETTE 1'\n",
" 'BIMBETTE 2',\n",
" 'BIMBETTE 3']\n",
"\n",
"groups = ['MOB',\n",
" 'ALL',\n",
" 'BOTH']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将每一行对白根据角色性别进行标记,并统计不同性别的对白数量"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Mark each line of dialogue by sex and count them\n",
"\n",
"sex_lines = {'Male': 0,\n",
" 'Female': 0}\n",
"\n",
"for line in lines:\n",
" # Extract speaker \n",
" speaker = line.split(':')[0]\n",
" \n",
" if speaker in females:\n",
" sex_lines['Female'] += 1\n",
" \n",
" elif sex_lines not in groups:\n",
" sex_lines['Male'] += 1\n",
"\n",
"print (sex_lines)\n",
"print (sex_lines['Male']/(sex_lines['Male'] + sex_lines['Female']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们使用一张图来显示结果:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Quick graphical representation \n",
"\n",
"df = pd.DataFrame([sex_lines.values()],columns=sex_lines.keys())\n",
"df.plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也许男性角色和女性角色的对白长度有明显不同?我们来看一看<br/>这次我们计算对白中单词数量而不是计算对白次数:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Maybe men and women talk for different lengths? This counts words instead of \n",
"\n",
"sex_words = {'Male': 0,\n",
" 'Female': 0}\n",
"\n",
"for line in lines:\n",
" speaker = line.split(':')[0]\n",
" dialogue = line.split(':')[1] \n",
" # remove the \n",
" # tokenize sentence by spaces\n",
" word_count = len(dialogue.split(' ')) \n",
" \n",
" if speaker in females:\n",
" sex_words['Female'] += word_count\n",
" elif speaker not in groups:\n",
" sex_words['Male'] += word_count\n",
"\n",
"print (sex_words)\n",
"print (sex_words['Male']/(sex_words['Male'] + sex_words['Female']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"也用图表显示出来:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Quick graphical representation \n",
"\n",
"df = pd.DataFrame([sex_words.values()],columns=sex_words.keys())\n",
"df.plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面是额外的《玩具总动员》的分析"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Bonus toy story analysis\n",
"\n",
"url = 'http://www.dailyscript.com/scripts/toy_story.html'\n",
"toy_story_script = requests.get(url).text\n",
"\n",
"# toy_story_script.splitlines()[250:350]\n",
"\n",
"lines = []\n",
"speaker = ''\n",
"dialogue = ''\n",
"for row in toy_story_script.splitlines()[90:]:\n",
" if ' ' in row: \n",
" if ':' not in speaker:\n",
" lines.append( {'Speaker': remove_paren(speaker).strip(),\n",
" 'Dialogue': remove_paren(dialogue).strip() } )\n",
" \n",
" speaker = remove_spaces(row.strip())\n",
" dialogue = ''\n",
" elif ' ' in row:\n",
" dialogue = dialogue + ' ' + remove_spaces(row)\n",
"lines.append( {'Speaker': remove_paren(speaker).strip(),\n",
" 'Dialogue': remove_paren(dialogue).strip() } )\n",
"\n",
"roles = defaultdict(int)\n",
"\n",
"for line in lines:\n",
" speaker = line['Speaker']\n",
" roles[speaker] = roles[speaker] + 1\n",
"\n",
"toy_story_df = pd.DataFrame(lines[1:])\n",
"toy_story_df.head()\n",
"\n",
"toy_story_df.Speaker.value_counts()\n",
"\n",
"def what_sex(speaker):\n",
" if speaker in [\"SID'S MOM\", 'MRS. DAVIS', 'HANNAH', 'BO PEEP']:\n",
" return 'Female'\n",
" return 'Male'\n",
"\n",
"toy_story_df['Sex'] = toy_story_df['Speaker'].apply(what_sex)\n",
"\n",
"sex_df = toy_story_df.groupby('Sex').size()\n",
"sex_df.plot(kind='bar')\n",
"sex_df\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def word_count(dialogue):\n",
" return len(dialogue.split())\n",
"\n",
"toy_story_df['Word Count'] = toy_story_df['Dialogue'].apply(word_count)\n",
"\n",
"word_df = toy_story_df.groupby('Sex')['Word Count'].sum()\n",
"word_df.plot(kind='bar')\n",
"word_df"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1+"
}
},
"nbformat": 4,
"nbformat_minor": 0
}