Preserving Web-Based Auction Catalogs

Gretchen Nadasky IMLS Grant Project Pratt Institute MSLIS LIS-698 Dr. Tula Giannini

Weekly Report

Louis Jean Desprez 1743-1804  "Illumination de la croix de Saint-Pierre à Rome"

Louis Jean Desprez 1743-1804 “Illumination de la croix de Saint-Pierre à Rome”

January 15, 2013

I returned to FARL after being away for the semester break to continue with the IMLS funded M-LEAD-TWO digitization project.  My task this semester is to identify auction site candidates for the next group of seeds. Although one of the greatest challenges for web-archiving are the sheer number of non-essential pages, it is impossible to manually quantify these on an individual level.  However, in setting priorities for seed candidates other attributes can help narrow the field.  The initial list of 221 auction houses was pared down to 137.  Data on technology, scope, depth of existing archive and language of these attributes was collected for the 147 subscription auction catalogs that have a website and entereinto a spreadsheet.

Today I just went back to the report I wrote at the end of the semester and the spreadsheet.  One of the things I will be focused on this semester is identifying websites that don’t have archives and could be at-risk.  My research revealed 20 of these sites.  I will be examining them for suitability as seeds for custom crawls.  The spreadsheet also gathered profile information on each website to begin to understand the suitability for FARL using resources to preserve them.

January 22, 2013

I had a meeting with Debbie Kempe and the project consultant Lilly Pregill to determine the goals for the semester.  We would like to accomplish the following:

1)   Re-visit the seed sites and make decisions which to continue harvesting and how frequency can be reduced

2)   Review catalog list and identify “on-line” only catalogs where print is no longer available

3)   Identify new crawl candidates and specific URL’s based on risk assessment, category relevance, availability of PDF, cross-check with WayBack

4)   Examine Arcade process of providing URL for link to auction house archives to begin to determine possible ways of presenting archived digital materials

5)   Prepare paper and talk for NYAC panel addressing webarchiving on  June 6

6)   Work with Archives department on Social Media Projects.

7)   Seminar classes (10 hours)

8)   Keep weekly Journal of Practicum Work.

9)   Preparation of project presentation (15 hours)

NOTE: I am fully aware that I can’t accomplish this entire list in the 100 hours alloted for the internship. I will probably do it anyway as I can see the potential for creating a professional report form the results.

January 29, 2013

This morning there was a staff meeting for all of FARL.  Each department talked about the initiatives they are undertaking.  In the archived department they are continuing with the a digitization project called “Guilded Age” which is a series of mini-collection from the Frick’s collection exhibiting turn of the century materials on-line.  The rest of the day I worked on researching prices for the auction catalogs.  So far the price range of a single catalog varies widely from $26-$70 for a single catalog.  Subscription prices depend on the number of auctions held per year and if the catalog can be ordered by department. A full subscription to Bonham’s costs $4100.00.

February 5, 2013

Working at FARL has its perks.  Spent most of lunch wandering around the collection.  The rest of the day I went to each auction house website to assess the cost of buying a catalog subscription.  Historically  auction houses have sent catalogs to FARL gratis on a regular basis.  In the last several years, however it has been more difficult to obtain these materials.  Auction houses are now asking for the full cost or the shipping cost or only sending the catalogs by request.  This is a monetary and time strain on collections and cataloging staff at the Frick.  While the goal of preserving these ephemeral materials for access by future researchers has not changed, the economic climate for the business of auctions has become more difficult.  This is a challenge for the Frick.

February 12, 2013

images

Who knew that FARL closed for Lincoln’s Birthday?  As a result I am alone in the office which in a way is fine by me!  I took the time to listen to Advanced crawl workshop sponsored by Archive-It.  The Archive-It team and network of people who use it is very supportive.  The people at Internet Archive respond quickly to questions and take suggestions for changes to the interface.

I get tired of hunting down pricing and  decide to change focus and analyze crawls. One of the problems I see is that the sites we are crawling are so large that much of the material on the site is not being captured because the crawl times out or looks out-of-scope to the crawler.  I will need to go through the URL’s and determine what is and isn’t being captured.  From there I will add limits to the next crawl.

February 19, 2013

Today my goal was to run test crawls on new potential websites to be harvested.  Once you get the hang of Archive-It it is not difficult to use but there are many different aspects of analyzing reports and initiating new crawls.  One of the gating factors to a full-fledged web-harvesting program at the Frick is the time-intensity of execution.  Crawls are done monthly and the results must be analyzed using a series of reports.  The reports reveal if the whole site was crawled and archived.  There are many variables that prevent a quality result.

Of course, FARL does not want to present an incomplete result to users so this phase of the process is extremely important.  Once omissions are discovered the scope of the crawl can be adjusted by either removing specific URL’s from inclusion or by expanding the scope of the crawl.  These sites are very large so the process is time-consuming and tedious.  Additionally, test crawls provide a report but do not archive the materials so you can’t see the assess auction catalog quality from a test crawl.

February 26, 2013

Today I had a long task list.  Analyze test crawls, identify new candidates for harvest and get some ideas for how other Archiv-It partners are giving user access to their website archives.

First, analyzed the results of test crawls first and added several limits.  One problem that I see in the reports is that the time limit for the crawl is exceeded before much of the desired materials can get harvested.  For example, so that we are not longer archiving things like the comics and arms catalogs that are out-of-scope for FARL.

Using the analysis that was mostly completed last semester I focus on the sites that do not offer an archive on their own site.  These are the ones that are most at risk of important materials being lost to future researchers.  However, there are several other considerations.  On a positive note, the new version of the Archive-It software is more compatible with FLASH so that the catalogs in that format can be viewed.  It remains to be seen, however, how those materials can be preserved once computers no longer support Flash technology in favor of HTML5.

February 29, 2013

Today’s projects:

  • Continued the analysis of the crawls and determining the kinds of limits that should be placed.  After running a test crawl on Ivey Selkirk noted that it is completely blocked by robots.txt.
  • Debbie Kempe invited me for lunch with her and Lilly Pregill at the Asia Society to describe web-archiving challenges. Debbie and Lilly were going to an update meeting with the Mellon Foundation.  (The meeting went well and the Foundation is very interested in our continuing the program)

In addition to this project I volunteered to help the Archives Department with a social media project.  Shannon Yule the associate archivist wants to create a HistoryPin site with photos from the collection of Artists in their studios.  The site will exhibit the images and place them on a map of Paris.  It is my job to take the addresses of the studios from an Excel spreadsheet and add it to the metadata in HistoryPin that will generate the data visualization.  The project can be seen at:

March 12, 2013

walcm_model_final

The Archive-It development team released a White Paper suggesting programs for archiving born-digital materials.  Today, I read the report and outlined the findings.

STEPS:

  1. Vision and Objectives- To preserve materials that are only available in digital formats. As extension of overall policy development and a supplement to auction materials that the Frick collects. To make use of freely available web content
  2. Resources and Workflow- Library Director – vision and collections policy.  NYARC consultant- helps develop policy, rights, cooperation, funding.  Intern – tasks (see below)
  3. Library Staff – Web archiving is a group effort for acquistion policy, technical expertise, running crawls, storage and access
  4. Access/Use/Reuse- All partners use Wayback Machine interface, Some need to restrict access, future feature is IP-address restrictions for on-site use policy, links to Wayback Machine, developing portals/landing pages

REQUIRES POLICIES

  1. Collection Management- choosing websites
  2. Preservation and storage
  3. Risk Management- Permissions policy to harvest websitesArchive-It uses the Oakland Archive Policy, 2002 http://www2.sims.berkeley.edu/research/conferences/aps/removal-policy.html
  4. Description and Metadata -Most organizations are not generating metadata for the collections in the application itself. Mostly collections-level metadata is prepared.

TASKS:

  1. Appraisal and Selection- Initial selection of subject then technical judgment of parameters
  2. Scoping- Host-constraining (specific domains) – beginning to experiment, Time limits. Single-format like pdf., match policies with what is able to be collected, Crawl frequency- may vary by site
  3. Run test crawls- Data Capture, Storage and Organization, Quality Assurance and Analysis

Takes place at beginning of process after test crawl and to check results

Continuing to analyze the crawl report.  Some of the constraints I added improved the results.  In one case I checked off the wrong box and it caused the site not to be saved properly. This is one of the risks of changing the directions on the monthly harvest.  It is hard to run complete test crawls to incorporate all the aspect of the crawl. Another challenge is keeping track of what is being done and the subsequent results.

For Bonhams

Some catalogs are being captured

Things like sales links end up in queue

https://www.bonhams.com/login/?next=/auctions/19830/lot/169/__hash__salesdisplay_flaglot_toggle

Although some information looks like it is stuck in the queue

URL from queue list http://www.bonhams.com/results/?page=1

It is actually being harvested

http://wayback.archive-it.org/2135/20130303170917/http://www.bonhams.com/auctions/20631/

Placed a lot of restrictions on Heritage Auctions ha.com to avoid arvhiving comics and movie posters.  That worked well as only harvest 2 URL’s but able to search all fine art.  The home page doesn’t come up now.  Must fix that

March 19, 2013

Today my task was to review the last crawl following the adjustments made to the monthly crawl.

Seeds that I changed parameters for the next monthly crawl to get better results. (waiting for report)

Heritage Auctions

Auction.de

Hosane

RW Oliver

Dreden-kunstauktion.de

Underperforming Crawls

DNFA – robots blocking images-deactivated  for April 3 crawl

Bonhams

International Auctioneers -deactivated  for April 3 crawl

Koller Auctions- deactivated  for April 3 crawl

Pandolfini- deactivated  for April 3 crawl

Working fine

Tajan

March 26, 2013

Attended Metadata for Digitization Workshop at METRO.  The IMLS grant sponsors one workshop so I decided to take this 2-day session lead by Dr. Marcia Zhang, a leading metadata expert and author.

April 3, 2013

My task list today included 1) entering new websites into Archive-It to be crawled in April’s harvest 2) Putting together a list of UK-based auction catalogs for the Library Director Stephen Bury 3) meeting with Suz Massen to discuss usage of auction catalogs in the reading room of FARL 4) meeting with Rodika Krauss about how archived catalogs can be included in the library catalog.

April 10, 2013

I spent most of the day analyzing the Scope-It report for the April 4 crawl.

The websites that were added:

http://www.kunstauktionshaus-schlosser.de/ -recommended by Debbie and only crawled 2X by IA

http://auctions.lawsons.com.au/

http://shapiro.com.au/

http://www.de-vuyst.com/nl

http://www.swanngalleries.com/index.cgi

http://www.illustrationhouse.com/

CRAWLED:  http://www.kunstauktionshaus-schlosser.de/ -recommended by Debbie and only crawled 2X by IA

CRAWLED:  http://auctions.lawsons.com.au/

1)   Blackholes in queue:

Forms to email for info about lot number:    http://www.lawsons.com.au/asp/email_lot.asp?salelot=7417++++++18+&refno=30489456

Log-in to bid:

http://www.lawsons.com.au/asp/bidnow.asp?salelot=7925A+++1110+&refno=31013466

2)   Auction house does not offer full range of catalogs in archive.  Would have to crawl often to capture content.  Not all content in scope

CRAWLED:  http://shapiro.com.au/

1)   Catalog is in HTML and loads in archive.  Can click through to get details

2)   Nothing in queue.  No robots.txt

3)    Site is small (50,000 URL’s)

CRAWLED:  http://www.de-vuyst.com/nl

1)   Only most recent catalog is offered on live site.  Was captured by crawl.

2)   Can click through for details

3)   English version of the website was not captured.  Could change scope?

CRAWLED:  http://www.swanngalleries.com/index.cgi

1)   Catalogs are out of scope

CRAWLED:

http://www.illustrationhouse.com/

April 17, 2013

I am still observing a variety of issued with the results of the webcrawls.  In order to prioritize sites for a comprehensive preservation program I will have to analyze the websites one by one.  The problem however with running test crawls is that Archive-It gives you a report about a test crawl but not the actual archive.  For our purposes where we don’t just want to archive webpages but also documents within the webpage we need to see what object are actually being capture.

It is not perfect but I decide to put the URL for each website into the Wayback Machine as a proxy for a crawl.  This has some disadvantages.  First, the Wayback Machine crawls a given site at irregular intervals.  So the results are essentially random.  Secondly, it is not clear to what extent a generic crawl captures a site.  Setting crawl parameters is more effective to capture everything on a site so looking at the Wayback Machine results won’t replicate a customized crawl.  On the other hand, by using the Wayback Machine I feel like I can get a sense of the extreme cases of which sites archive well and those that cannot be captured at all.  It is likely that the sites that land in a middle category would be those that would require extra effort to set crawl parameters.

I decide to collect the data on how many times the Wayback Machine has captured each site and look at the results.  I plan to record any issues that arise in getting the catalogs form the Wayback Machine crawl to attempt to quantify the possibilities and problems of web-archiving overall.

ARLISNA

April 25-29, 2013

Debbie Kempe arranged for me to travel to Pasadena for the ARLIS/NA Annual Conference.  I was very grateful to have the opportunity to go and meet the top art librarians talk about their projects. I had a great time and it was there I learned about the FAB project and saw a presentation about data visualizations using the Getty Provenance Index.

FAB

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Gretchen Nadasky

Gretchen Nadasky

%d bloggers like this: