php数据采集 PHP数据采集主要包括三个过程

php采集大数据的方案

1、建议你读写数据和下载图片分开，各用不同的进程完成。

创新互联服务项目包括梓潼网站建设、梓潼网站制作、梓潼网页制作以及梓潼网络营销策划等。多年来，我们专注于互联网行业，利用自身积累的技术优势、行业经验、深度合作伙伴关系等，向广大中小型企业、政府机构等提供互联网行业的解决方案，梓潼网站推广取得了明显的社会效益与经济效益。目前，我们服务的客户以成都为中心已经辐射到梓潼省份的部分城市，未来相信会继续扩大服务区域并继续获得客户的支持与信任！

比如说，取数据用get-data.php，下载图片用get-image.php。

2、多进程的话，php可以简单的用pcntl_fork()。这样可以并发多个子进程。

但是我不建议你用fork，我建议你安装一个gearman worker。这样你要并发几个，就启几个worker，写代码简单，根本不用在代码里考虑thread啊，process等等。

3、综上，解决方案这样：

（1）安装gearman worker。

（2）写一个get-data.php，在crontab里设置它每5分钟执行一次，只负责读数据，然后把读回来的数据一条一条的扔到 gearman worker的队列里；

然后再写一个处理数据的脚本作为worker，例如叫process-data.php，这个脚本常驻内存。它作为worker从geraman 队列里读出一条一条的数据，然后跟你的数据库老数据比较，进行你的业务逻辑。如果你要10个并发，那就启动10个process-data.php好了。处理完后，如果图片地址有变动需要下载图片，就把图片地址扔到 gearman worker的另一个队列里。

（3）再写一个download-data.php，作为下载图片的worker，同样，你启动10个20个并发随便你。这个进程也常驻内存运行，从gearman worker的图片数据队列里取数据出来，下载图片

4、常驻进程的话，就是在代码里写个while(true)死循环，让它一直运行好了。如果怕内存泄露啥的，你可以每循环10万次退出一下。然后在crontab里设置，每分钟检查一下进程有没有启动，比如说这样启动3个process-data worker进程：

* * * * * flock -xn /tmp/process-data.1.lock -c '/usr/bin/php /process-data.php /dev/null 21'

* * * * * flock -xn /tmp/process-data.2.lock -c '/usr/bin/php /process-data.php /dev/null 21'

* * * * * flock -xn /tmp/process-data.3.lock -c '/usr/bin/php /process-data.php /dev/null 21'

不知道你明白了没有

php 百度知道数据采集

问题其实不难，自己都能写。给你几个思路吧：

1.在百度知道中，输入linux，然后会出现列表。复制浏览器地址栏内容。

然后翻页，在复制地址栏内容，看看有什么不同，不同之处，就是你要循环分页的i值。

当然这个是笨方法。

2.使用php的file或者file_get_contents函数，获取链接URL的内容。

3.通过php正则表达式，获取你需要的3个字段内容。

4.写入数据库。

需要注意的是，百度知道有可能做了防抓取的功能，你刚一抓几个页面，可能会被禁止。

建议也就抓10页数据。

其实不难，你肯定写的出来。还有，网上应该有很多抓取工具，你找找看，然后将抓下来的数据

在做分析。写入数据库。

PHP 采集程序中常用的函数

复制代码

代码如下:

//获得当前的脚本网址

function

get_php_url()

{

if(!empty($_SERVER[”REQUEST_URI”]))

{

$scriptName

$_SERVER[”REQUEST_URI”];

$nowurl

$scriptName;

}

else

{

$scriptName

$_SERVER[”PHP_SELF”];

if(empty($_SERVER[”QUERY_STRING”]))

$nowurl

$scriptName;

else

$nowurl

$scriptName.”?”.$_SERVER[”QUERY_STRING”];

}

return

$nowurl;

}

//把全角数字转为半角数字

function

GetAlabNum($fnum)

{

$nums

array(”0”,”1”,”2”,”3”,”4”,”5”,”6”,”7”,”8”,”9”);

$fnums

“0123456789″;

for($i=0;$i=9;$i++)

$fnum

str_replace($nums[$i],$fnums[$i],$fnum);

$fnum

ereg_replace(”[^0-9\.]|^0{1,}”,””,$fnum);

if($fnum==””)

$fnum=0;

return

$fnum;

}

//去除HTML标记

function

Text2Html($txt)

{

$txt

str_replace(”

“,”　”,$txt);

$txt

str_replace(””,””,$txt);

$txt

str_replace(””,””,$txt);

$txt

preg_replace(”/[\r\n]{1,}/isU”,”br/\r\n”,$txt);

return

$txt;

}

//清除HTML标记

function

ClearHtml($str)

{

$str

str_replace('','',$str);

$str

str_replace('','',$str);

return

$str;

}

//相对路径转化成绝对路径

function

relative_to_absolute($content,

$feed_url)

{

preg_match('/(http|https|ftp):\/\//',

$feed_url,

$protocol);

$server_url

preg_replace(”/(http|https|ftp|news):\/\//”,

“”,

$feed_url);

$server_url

preg_replace(”/\/.*/”,

“”,

$server_url);

($server_url

”)

{

return

$content;

}

(isset($protocol[0]))

{

$new_content

preg_replace('/href=”\//',

‘href=”‘.$protocol[0].$server_url.'/',

$content);

$new_content

preg_replace('/src=”\//',

'src=”‘.$protocol[0].$server_url.'/',

$new_content);

}

else

{

$new_content

$content;

}

return

$new_content;

}

//取得所有链接

function

get_all_url($code){

preg_match_all('/a\s+href=[”|\']?([^”\'

]+)[”|\']?\s*[^]*([^]+)\/a/i',$code,$arr);

return

array('name'=$arr[2],'url'=$arr[1]);

}

//获取指定标记中的内容

function

get_tag_data($str,

$start,

$end)

{

(

$start

”

$end

”

)

{

return;

}

$str

explode($start,

$str);

$str

explode($end,

$str[1]);

return

$str[0];

}

//HTML表格的每行转为CSV格式数组

function

get_tr_array($table)

{

$table

preg_replace(”‘td[^]*?'si”,'”‘,$table);

$table

str_replace(”/td”,'”,',$table);

$table

str_replace(”/tr”,”{tr}”,$table);

//去掉

HTML

标记

$table

preg_replace(”‘[\/\!]*?[^]*?'si”,””,$table);

//去掉空白字符

$table

preg_replace(”‘([\r\n])[\s]+'”,””,$table);

$table

str_replace(”

“,””,$table);

$table

str_replace(”

“,””,$table);

$table

explode(”,{tr}”,$table);

array_pop($table);

return

$table;

}

//将HTML表格的每行每列转为数组，采集表格数据

function

get_td_array($table)

{

$table

preg_replace(”‘table[^]*?'si”,””,$table);

$table

preg_replace(”‘tr[^]*?'si”,””,$table);

$table

preg_replace(”‘td[^]*?'si”,””,$table);

$table

str_replace(”/tr”,”{tr}”,$table);

$table

str_replace(”/td”,”{td}”,$table);

//去掉

HTML

标记

$table

preg_replace(”‘[\/\!]*?[^]*?'si”,””,$table);

//去掉空白字符

$table

preg_replace(”‘([\r\n])[\s]+'”,””,$table);

$table

str_replace(”

“,””,$table);

$table

str_replace(”

“,””,$table);

$table

explode('{tr}',

$table);

array_pop($table);

foreach

($table

$key=$tr)

{

$td

explode('{td}',

$tr);

array_pop($td);

$td_array[]

$td;

}

return

$td_array;

}

//返回字符串中的所有单词

$distinct=true

去除重复

function

split_en_str($str,$distinct=true)

{

preg_match_all('/([a-zA-Z]+)/',$str,$match);

($distinct

true)

{

$match[1]

array_unique($match[1]);

}

sort($match[1]);

return

$match[1];

}

怎么用php采集网站数据

简单的分了几个步骤：

1、确定采集目标

2、获取目标远程页面内容（curl、file_get_contents）

3、分析页面html源码，正则匹配你需要的内容（preg_match、preg_match_all），这一步最为重要，不同页面正则匹配规则不一样

4、入库

php通过post传输的json数据能采集吗

不能。所谓的json数据格式是http请求中的body是一个json格式的字符串，这个用$_POST就获取不到了。PHP是一种易于学习和使用的服务器端脚本语言。只需要很少的编程知识你就能使用PHP建立一个真正交互的WEB站点。

php curl 大量数据采集

这个需要配合js，打开一个html页面，首先js用ajax请求页面，返回第一个页面信息确定处理完毕（ajax有强制同步功能），ajax再访问第二个页面。（或者根据服务器状况，你可以同时提交几个URL，跑几个相同的页面）

参数可以由js产生并传递url，php后台页面根据URL抓页面。然后ajax通过php，在数据库或者是哪里设一个标量，标明检测到哪里。由于前台的html页面执行多少时候都没问题，这样php的内存限制和执行时间限制就解决了。

因为不会浪费大量的资源用一个页面来跑一个瞬间500次的for循环了。（你的500次for循环死了原因可能是获取的数据太多，大过了php限制的内存）

不过印象中curl好像也有强制同步的选项，就是等待一个抓取后再执行下一步。但是这个500次都是用一个页面线程处理，也就是说肯定会远远大于30秒的默认执行时间。

本文名称：php数据采集 PHP数据采集主要包括三个过程
文章起源：http://ybzwz.com/article/doceoih.html

php数据采集 PHP数据采集主要包括三个过程

php采集大数据的方案

php 百度 知道数据采集

PHP 采集程序中常用的函数

怎么用php采集网站数据

php通过post传输的json数据能采集吗

php curl 大量数据采集

其他资讯

php 百度知道数据采集