214 lines
5.6 KiB
Plaintext
214 lines
5.6 KiB
Plaintext
|
{
|
|||
|
"cells": [
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# 一个R语言实现的爬虫,爬取拉手网美食信息"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"点击下边的cell,点击上方工具栏里的执行图标,即可执行代码块,看到输出结果。代码块左边的In[]出现In[*]表示代码正在执行"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"library(XML)\n",
|
|||
|
"\n",
|
|||
|
"giveNames = function(rootNode){\n",
|
|||
|
" names <- xpathSApply(rootNode,\"//h3/a[@class='goods-name']\",xmlValue)\n",
|
|||
|
" names\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"givesevices = function(rootNode){\n",
|
|||
|
" sevices <- xpathSApply(rootNode,\"//h3/a[@class='goods-text']\",xmlValue)\n",
|
|||
|
" sevices\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"giveprices = function(rootNode){\n",
|
|||
|
" prices <- xpathSApply(rootNode,\"//div/span[@class='price']\",xmlValue)\n",
|
|||
|
" prices\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"givemoney = function(rootNode){\n",
|
|||
|
" money <- xpathSApply(rootNode,\"//div/span[@class='money']\",xmlValue)\n",
|
|||
|
" money\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"giveplaces = function(rootNode){\n",
|
|||
|
" places <- xpathSApply(rootNode,\"//a/span[@class='goods-place']\",xmlValue)\n",
|
|||
|
" places\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"getmeituan = function(URL){\n",
|
|||
|
" Sys.sleep(runif(1,1,2))\n",
|
|||
|
" doc<-htmlParse(URL[1],encoding=\"UTF-8\")\n",
|
|||
|
" rootNode<-xmlRoot(doc)\n",
|
|||
|
" data.frame(\n",
|
|||
|
" Names=giveNames(rootNode), #店名\n",
|
|||
|
" services=givesevices(rootNode), #服务\n",
|
|||
|
" prices=giveprices(rootNode), #现价\n",
|
|||
|
" money=givemoney(rootNode), #原价\n",
|
|||
|
" places=giveplaces(rootNode) #地点\n",
|
|||
|
" \n",
|
|||
|
" )\n",
|
|||
|
"}\n",
|
|||
|
"\n",
|
|||
|
"\n",
|
|||
|
"URL = paste0(\"http://shenzhen.lashou.com/cate/meishi/page\",1:10)\n",
|
|||
|
"\n",
|
|||
|
"mainfunction = function(URL){\n",
|
|||
|
" data = rbind(\n",
|
|||
|
" getmeituan (URL[1]),\n",
|
|||
|
" getmeituan (URL[2]),\n",
|
|||
|
" getmeituan (URL[3]),\n",
|
|||
|
" getmeituan (URL[4]),\n",
|
|||
|
" getmeituan (URL[5])\n",
|
|||
|
" )\n",
|
|||
|
" \n",
|
|||
|
" \n",
|
|||
|
"}\n",
|
|||
|
"ll=mainfunction(URL)\n",
|
|||
|
"write.table(ll,\"result.txt\",row.names=FALSE)\n",
|
|||
|
"ll\n"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# R语言的线性回归实例"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"输入数据"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"#读入数据\n",
|
|||
|
"x<- c(0.10,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.20,0.21,0.23)\n",
|
|||
|
"y<-c(42.0,43.5,45.0,45.5,45.0,47.5,49,53,50,55,55,60)\n",
|
|||
|
"#绘出 x 与 y 的散列图\n",
|
|||
|
"plot(y~x)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"执行该段代码,即可看到输出图形;从图中我们可以看出 y 和 x 存在线性相关性,可以进行线性回归分析:"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"model<-lm(y~x)\n",
|
|||
|
"summary(model)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"我们通过 P 值(就是上面的 pr 那一列)来查看对应的解释变量 x 的显著性,通过将 p 值与 0.05 进行比较,若改值小于 0.05,就可以说该变量与被解释变量存在显著的相关性。\n",
|
|||
|
"\n",
|
|||
|
"Multiple R-squared 和 Adjusted R-squared 这两个值,就是我们常称为”拟合优度“和”修正的拟合优度“,是指回归方程对样本的拟合程度,这里我们可以看到,修正的拟合优度为 0.9429,表示拟合程度超过五成,这个值越高越好。\n",
|
|||
|
"\n",
|
|||
|
"最后,看下 F-statistic,也就是常说的 F 统计量,也称为 F 检验,常用语判断方程整体的显著性实验,其 p 值为 9.505e-08,显然小于 0.05,我们可以认为方程在 P=0.05 的水平上是通过显著性检验的。\n",
|
|||
|
"\n",
|
|||
|
"从上面我们看出我们的线性回归效果不错,那么我们可以利用拟合方程进行分类,或者预测。"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"newX<-data.frame(x=0.16)\n",
|
|||
|
"predict(model,newdata=newX,interval=\"prediction\",level=0.95)#interval=”prediction“ level指定预测的置信区间"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "markdown",
|
|||
|
"metadata": {},
|
|||
|
"source": [
|
|||
|
"# R语言的逻辑回归示例"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"collapsed": false
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": [
|
|||
|
"counts <- c(18,17,15,20,10,20,25,13,12)\n",
|
|||
|
"outcome <- gl(3,1,9)\n",
|
|||
|
"treatment <- gl(3,3)\n",
|
|||
|
"print(d.AD <- data.frame(treatment, outcome, counts))\n",
|
|||
|
"glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())\n",
|
|||
|
"anova(glm.D93)\n",
|
|||
|
"summary(glm.D93)"
|
|||
|
]
|
|||
|
},
|
|||
|
{
|
|||
|
"cell_type": "code",
|
|||
|
"execution_count": null,
|
|||
|
"metadata": {
|
|||
|
"collapsed": true
|
|||
|
},
|
|||
|
"outputs": [],
|
|||
|
"source": []
|
|||
|
}
|
|||
|
],
|
|||
|
"metadata": {
|
|||
|
"kernelspec": {
|
|||
|
"display_name": "R",
|
|||
|
"language": "R",
|
|||
|
"name": "ir"
|
|||
|
},
|
|||
|
"language_info": {
|
|||
|
"codemirror_mode": "r",
|
|||
|
"file_extension": ".r",
|
|||
|
"mimetype": "text/x-r-source",
|
|||
|
"name": "R",
|
|||
|
"pygments_lexer": "r",
|
|||
|
"version": "3.2.3"
|
|||
|
}
|
|||
|
},
|
|||
|
"nbformat": 4,
|
|||
|
"nbformat_minor": 0
|
|||
|
}
|