Fetching Web Pages from the WebBase Web Page Repository

InfoLab was Database Group
                                                                                                                                       Gary Wesley (Bluegee gmail)
Updated  October 2013


Herein is described how to retrieve Web pages from the Stanford WebBase Archive,
a World Wide Web page repository built as part of the Stanford Digital Libraries Project
by members of the
Stanford InfoLab.


WebBase visited by over 37 countries


The Repository

This web repository is over 260TB ( uncompressed size as of August 2011 ) of 7Billion web pages intended for research into topics such as web graph analysis and election or disaster press coverage ( we have  a workbench for press coverage analysis and coding ).
The general text crawls are each about  0.5TB compressed  ( 1.5TB uncompressed ). Sizes below are in compressed units.  We now effectively have rudimentary time series data. General crawls use almost the same site list each time. Building the client software.   Lists of sites with page counts is available via the "sites" links below.  Architecture diagram. Our web crawler or spider is named WebVac. Technical report: Stanford WebBase Components and Applications.  We are working in cooperation with the Library of Congress and the California Digital Library.

We now have tools for computational sociology in our Web Sociologist's Workbench. It was used for election coverage analysis by the Stanford Communication Department. Picture of  a sample screen. (The letter in the checkbox label is a keyboard shortcut.) Here is  a 2007 report on our efforts. A version of this is being used for a memetic epidemiology project with the Stanford Medical School 
involving Myspace blogs.



We have a collection of the links from each of the general crawls. These are available upon request via ftp.

We have a C++ tool to convert from our format to ARC version 1 format (Internet Archive and Heretrix). We are developing one for WARC (now  ISO) and International Internet Preservation Consortium (IIPC) standard. County, city, state and federal  crawls through  2008 have been converted to ARC.  We are considering converting the entire operation to WARC.



Wibbi:
If you don't want to bother with the client because you will not be building custom handlers, there is now  a Web interface to the crawls. There are several custom filters to choose from like and and or. Wibbi will give slower throughput than our C++ client, even with no filtering. A Windows/Linux browser limit (Except  Opera and Firefox 2.0.0.1+) causes you to only be able to download 4GB at a time. Since the filters are run on our server, it is possible to filter more data than that but not to reach that limit.

If you decide to use the data, please email Gary ( Jeez cs stanford Edu) for our funding requests.
We would also appreciate knowing of any papers that come out of your usage.


WebVac spider

WebVac crawls depth first, generally to a depth of 7 levels and fetches a maximum of 10k pgs per site.
We only follow links to pages within the domain. Til 2007 our general policy was to gather a 1.5TB sample.
Now we crawl a list of sites, til the list is done or the month is over. We retry unavailables several times.
We pause 1 to ( almost always ) 12 seconds between pages, depending on ipaddress bottlenecks.
For the federal government crawls, we take up to 150,000 pages to 12 levels over
a fairly static group of sites.

 


 

Architecture

WebBase Architecture
Overall system screenshot  from 2007Screen shot


Client Software ( RunHandlers )

  • If you don't want to bother with the client because you will not be building custom handlers, there is now a  Web interface to get pieces of up to 4GB of the 2003-present crawls on Wibbi
  • These instructions assume Internet access to the machine hosting WebBase data and a  CVS checkout of the WebBase code or an ftp get.
  • We allow specification of machine, port, first site and last site for the stream (e.g. www.ibm.com). Distribrequestor.pl and getpages.pl also take those arguments.  The webpage repository is organised by site, so offset means offset within the site.

  • RunHandlers is supported on 32-bit GNU/Linux and Solaris systems with GNU make (gmake), g++ ( <=  3.4.0), Perl 5.05+, and W3C's libwww.

    1. Fetch the latest WebBase client source code from ftp://db.stanford.edu/pub/webbase.
    2. Unroll the source code. For example, GNU tar can do this with

    3. > tar xfz webbase-client-????-??-??.tar.gz
    4. Follow the instructions in the source code's README.client.

    5. > chdir dli2/src/WebBase/ && more README.client
    Build everything:
    (Use a  32-bit Linux box)

    Make sure the library path includes W3C's libwww .
    This library must be installed by a  system administrator with root privileges.

    Make sure environment variable WEBBASE points to WebBase:
    setenv WEBBASE  [absolute path]/WebBase

    (1) Run GNU make:

       WebBase/> ./configure
       WebBase/> make client

    If you get:
    handlers/extract-hosts.h:27:21: WWWCore.h: No such file or directory

    handlers/extract-hosts.h:28:21: HTParse.h: No such file or directory

         Your include path may be wrong:
    We expect it to be in /usr/local/include/w3c-libwww/WWWCore.h,
    so you may need to change this in Makefile.in and configure.
    (Order MAY matter)
    Rerun ./configure.

             To use later gcc versions:
                     Here's the hack:

                   After running ./configure, do the following:

                   1. add -fpermissive in the CPPFLAGS on line 68 in the makefile
                   2. comment out
                                   lines 34 and 35 in hashlookup/hashlookup.h
                                    extern unsigned int hashlookup_error;
                                    extern unsigned int verbose_error;


    (2) Test your build.
         (a) Turn on cat-handler, which simply outputs what it receives.
                In inputs/webbase.conf, set
                CAT_ON = 1
         (b) Try RunHandlers on a  local example file:
                bin/RunHandlers inputs/webbase.conf \
               "file:///handlers/example-50-pages"
              [50 sample pages are printed]

    Now try the network version:

    Method 1:
     Run scripts/distribrequestor.pl to start a  distributor:
     (either chmod +x scripts/*.pl or invoke it with "perl")
    args: (must be in this order)
    # host
    # port
    # num pages
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)
     

    [example run:]
    WebBase/scripts> distribrequestor.pl wb1 7008 100
     distrib daemon returned 171.64.75.151 7160
     (use as ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100" )
    WebBase/scripts>
     Now you can invoke RunHandlers with the above info:
     ( cut and paste it from the echo)
    WebBase/scripts> ../bin/RunHandlers ../inputs/webbase.conf "net://171.64.75.151:7160/?numPages=100"
     will print back 100 sample pages.  All instances of RunHandlers connected to
     the above port will share the same pool of pages.  To get an independent
     stream, run distribrequestor.pl to get a  new port.
     

    Method 2:
     You can also use our one-step script getpages.pl (no need to specify a  first site )
    (either chmod +x scripts/*.pl or invoke it with "perl")

    [example run:]
    args: (must be in this order)
    # num pages
    # host
    # port
    # starting web site (optional) e.g. www.ibm.com
    # ending web site (optional)
    # offset in bytes within web site (optional)


    WebBase/scripts> getpages.pl 2 wb1 7008 www.ibm.com www.ibm.com (only give me www.ibm.com)
     Starting getpages.pl using Perl 5.6.0
     Do you want to run
    /dfs/sole/6/gary/dli2/src/WebBase//bin/RunHandlers /dfs/sole/6/gary/dli2/src/WebBase//inputs/webbase.conf "net://171.64.75.151:7163/?numPages=2" now?(Y/N):
    WebBase/scripts> Y

    To get all of the page, set CAT_ON = 1 in the inputs/*.conf.

    If you get the ERROR:
    bin/RunHandlers: error while loading shared libraries: libwwwcore.so.0:
    cannot open shared object file: No such file or directory
    you don't have your paths set right.
    setting a variable called LD_LIBRARY_PATH where you're about to run the
    WebBase client.  For example, if you found your libwwwcore.so in your
    /opt/somewhere/lib/libwwwcore.so, then you could tell your system:
    setenv LD_LIBRARY_PATH /opt/somewhere/lib

    Return codes:
    contact us to report these:

    blank page means there is no server running on that port
    If you get a line of just numbers and not much else:
    256 means I have a distributor running on a server with no data or a dangling
    softlink
    32512 is usually a missing softlink on the server
    ( fix is ln -s /u/gary/WebBase.centos/bin/runhandlers /lfs/1/tmp/webbase/runhandlers )
    or it is missing  shared libraries: libwwwutils.so.0

    Note on the output:

    This next line is just a  separator, so that RunHandler knows it is getting a  new page:
    ==P=>>>>=i===<<<<=T===>=A===<=!Jung[...]  -- page separator
    URL: http://www.powa.org/ -- page URL
    Date: June 3, 2004                -- when crawled
    Position: 695                         -- bytes into the site so far
    DocId: 1                                 -- sequential page id within site
    HTTP/1.1 200 OK                -- response to our http request


    Death threat:
    If a  distributor is inactive for a  while, it may be killed by us so that we can reuse the resources.
    To restart at the same point you must start a  new distributor  @ the offset where it left off
    ( + 1 to prevent getting the previous page again).

    Putting out a  contract:
    If you are done, you can run  distribrelease.pl [remote-host] [host port] [stream port]
    from the same machine you requested on. We will immediately kill the distributor for you.
    We especially recommend this if you are running
    many requests in 1 day so that we do not run out of resources.

    If you specify firstSite/lastSite, please note that you can only use the root
    (e.g. www.ibm.com) not a  page within the site (e.g. 01net.com/envoyerArticle/1 )
    and dont include the http:// part.

    -------------------------------------------------------------------
     

    To create a  new webpage stream handler:

    You can use the other handlers in the distribution as templates.
    To add a  new handler, add the following to the appropriate places:
     * 1) #include "myhandler.h" into handlers/all_handlers.h
     * 2) handler.push_back(new MyHandler()); into handlers/all_handlers.h
             (following the template of the handlers already there)
     * 3) in Makefile, add entries for your segments to compile
             in the line: HANDLER_OBJS = jhandler.o [...]
     *opt)in Makefile, customize your build if necessary by adding a  line
               jhandler_CXXFLAGS = -Iyour-include-dir --your-switches [...]
              (following the template of the handlers already there)

    We also have a  one-button script called scripts/addHandler.pl that will
    prompt you for all your pieces and put them in place, without you having
    to do the above file surgery yourself.
     
     


     

    GLOSSARY


    WebVac - the WebBase web crawler or spider. Used to be called Pita.

    RunHandlers - (formerly "process") an executable that indexes a  stream,
                  file or repository.
                  Made up basically of a  feeder and one or more handlers.

    handler - the interface that any index-building piece of code must implement.
              The interface's main (only) method will provide a  page and associated
              metadata and the implementor of the method can do whatever he wants
              with it.

    feeder -  the interface for receiving a  page stream from any kind of source
              (directly from the repository, via Webcat, via network, etc.). The
              key method of the interface is "next" which advances the stream by one
              page. After calling next, various other methods can be used to get the
              associated metadata for the current page in the stream. Can also be used
              to build indexes if the index-building code is written to process page
              streams

    distributor - a  program that disseminates pages to multiple clients
               over the network, supporting session ID's, etc...generalization of what
                Distributor.cc in Text -index/ does.

    offset - used in distributor requests to specify how many bytes to start from
             the beginning of the site.

    DocId - DocId is computed within the download. If you download any portion of the crawl,
                   even from the middle,  it will begin with 0.  If you download all the crawl,
                   it will be monotonically increasing from start to end.

    flutes