Gretchen Nadasky IMLS Grant Project Pratt Institute MSLIS LIS-698 Dr. Tula Giannini
I returned to FARL after being away for the semester break to continue with the IMLS funded M-LEAD-TWO digitization project. My task this semester is to identify auction site candidates for the next group of seeds. Although one of the greatest challenges for web-archiving are the sheer number of non-essential pages, it is impossible to manually quantify these on an individual level. However, in setting priorities for seed candidates other attributes can help narrow the field. The initial list of 221 auction houses was pared down to 137. Data on technology, scope, depth of existing archive and language of these attributes was collected for the 147 subscription auction catalogs that have a website and entereinto a spreadsheet.
Today I just went back to the report I wrote at the end of the semester and the spreadsheet. One of the things I will be focused on this semester is identifying websites that don’t have archives and could be at-risk. My research revealed 20 of these sites. I will be examining them for suitability as seeds for custom crawls. The spreadsheet also gathered profile information on each website to begin to understand the suitability for FARL using resources to preserve them.
I had a meeting with Debbie Kempe and the project consultant Lilly Pregill to determine the goals for the semester. We would like to accomplish the following:
1) Re-visit the seed sites and make decisions which to continue harvesting and how frequency can be reduced
2) Review catalog list and identify “on-line” only catalogs where print is no longer available
3) Identify new crawl candidates and specific URL’s based on risk assessment, category relevance, availability of PDF, cross-check with WayBack
4) Examine Arcade process of providing URL for link to auction house archives to begin to determine possible ways of presenting archived digital materials
5) Prepare paper and talk for NYAC panel addressing webarchiving on June 6
6) Work with Archives department on Social Media Projects.
7) Seminar classes (10 hours)
8) Keep weekly Journal of Practicum Work.
9) Preparation of project presentation (15 hours)
NOTE: I am fully aware that I can’t accomplish this entire list in the 100 hours alloted for the internship. I will probably do it anyway as I can see the potential for creating a professional report form the results.
This morning there was a staff meeting for all of FARL. Each department talked about the initiatives they are undertaking. In the archived department they are continuing with the a digitization project called “Guilded Age” which is a series of mini-collection from the Frick’s collection exhibiting turn of the century materials on-line. The rest of the day I worked on researching prices for the auction catalogs. So far the price range of a single catalog varies widely from $26-$70 for a single catalog. Subscription prices depend on the number of auctions held per year and if the catalog can be ordered by department. A full subscription to Bonham’s costs $4100.00.
Working at FARL has its perks. Spent most of lunch wandering around the collection. The rest of the day I went to each auction house website to assess the cost of buying a catalog subscription. Historically auction houses have sent catalogs to FARL gratis on a regular basis. In the last several years, however it has been more difficult to obtain these materials. Auction houses are now asking for the full cost or the shipping cost or only sending the catalogs by request. This is a monetary and time strain on collections and cataloging staff at the Frick. While the goal of preserving these ephemeral materials for access by future researchers has not changed, the economic climate for the business of auctions has become more difficult. This is a challenge for the Frick.
Who knew that FARL closed for Lincoln’s Birthday? As a result I am alone in the office which in a way is fine by me! I took the time to listen to Advanced crawl workshop sponsored by Archive-It. The Archive-It team and network of people who use it is very supportive. The people at Internet Archive respond quickly to questions and take suggestions for changes to the interface.
I get tired of hunting down pricing and decide to change focus and analyze crawls. One of the problems I see is that the sites we are crawling are so large that much of the material on the site is not being captured because the crawl times out or looks out-of-scope to the crawler. I will need to go through the URL’s and determine what is and isn’t being captured. From there I will add limits to the next crawl.
Today my goal was to run test crawls on new potential websites to be harvested. Once you get the hang of Archive-It it is not difficult to use but there are many different aspects of analyzing reports and initiating new crawls. One of the gating factors to a full-fledged web-harvesting program at the Frick is the time-intensity of execution. Crawls are done monthly and the results must be analyzed using a series of reports. The reports reveal if the whole site was crawled and archived. There are many variables that prevent a quality result.
Of course, FARL does not want to present an incomplete result to users so this phase of the process is extremely important. Once omissions are discovered the scope of the crawl can be adjusted by either removing specific URL’s from inclusion or by expanding the scope of the crawl. These sites are very large so the process is time-consuming and tedious. Additionally, test crawls provide a report but do not archive the materials so you can’t see the assess auction catalog quality from a test crawl.
Today I had a long task list. Analyze test crawls, identify new candidates for harvest and get some ideas for how other Archiv-It partners are giving user access to their website archives.
First, analyzed the results of test crawls first and added several limits. One problem that I see in the reports is that the time limit for the crawl is exceeded before much of the desired materials can get harvested. For example, so that we are not longer archiving things like the comics and arms catalogs that are out-of-scope for FARL.
Using the analysis that was mostly completed last semester I focus on the sites that do not offer an archive on their own site. These are the ones that are most at risk of important materials being lost to future researchers. However, there are several other considerations. On a positive note, the new version of the Archive-It software is more compatible with FLASH so that the catalogs in that format can be viewed. It remains to be seen, however, how those materials can be preserved once computers no longer support Flash technology in favor of HTML5.
In addition to this project I volunteered to help the Archives Department with a social media project. Shannon Yule the associate archivist wants to create a HistoryPin site with photos from the collection of Artists in their studios. The site will exhibit the images and place them on a map of Paris. It is my job to take the addresses of the studios from an Excel spreadsheet and add it to the metadata in HistoryPin that will generate the data visualization. The project can be seen at:
The Archive-It development team released a White Paper suggesting programs for archiving born-digital materials. Today, I read the report and outlined the findings.
Takes place at beginning of process after test crawl and to check results
Continuing to analyze the crawl report. Some of the constraints I added improved the results. In one case I checked off the wrong box and it caused the site not to be saved properly. This is one of the risks of changing the directions on the monthly harvest. It is hard to run complete test crawls to incorporate all the aspect of the crawl. Another challenge is keeping track of what is being done and the subsequent results.
Some catalogs are being captured
Things like sales links end up in queue
Although some information looks like it is stuck in the queue
URL from queue list http://www.bonhams.com/results/?page=1
It is actually being harvested
Placed a lot of restrictions on Heritage Auctions ha.com to avoid arvhiving comics and movie posters. That worked well as only harvest 2 URL’s but able to search all fine art. The home page doesn’t come up now. Must fix that
Today my task was to review the last crawl following the adjustments made to the monthly crawl.
Seeds that I changed parameters for the next monthly crawl to get better results. (waiting for report)
DNFA – robots blocking images-deactivated for April 3 crawl
International Auctioneers -deactivated for April 3 crawl
Koller Auctions- deactivated for April 3 crawl
Pandolfini- deactivated for April 3 crawl
Attended Metadata for Digitization Workshop at METRO. The IMLS grant sponsors one workshop so I decided to take this 2-day session lead by Dr. Marcia Zhang, a leading metadata expert and author.
April 3, 2013
My task list today included 1) entering new websites into Archive-It to be crawled in April’s harvest 2) Putting together a list of UK-based auction catalogs for the Library Director Stephen Bury 3) meeting with Suz Massen to discuss usage of auction catalogs in the reading room of FARL 4) meeting with Rodika Krauss about how archived catalogs can be included in the library catalog.
I spent most of the day analyzing the Scope-It report for the April 4 crawl.
The websites that were added:
http://www.kunstauktionshaus-schlosser.de/ -recommended by Debbie and only crawled 2X by IA
CRAWLED: http://www.kunstauktionshaus-schlosser.de/ -recommended by Debbie and only crawled 2X by IA
1) Blackholes in queue:
Forms to email for info about lot number: http://www.lawsons.com.au/asp/email_lot.asp?salelot=7417++++++18+&refno=30489456
Log-in to bid:
2) Auction house does not offer full range of catalogs in archive. Would have to crawl often to capture content. Not all content in scope
1) Catalog is in HTML and loads in archive. Can click through to get details
2) Nothing in queue. No robots.txt
3) Site is small (50,000 URL’s)
1) Only most recent catalog is offered on live site. Was captured by crawl.
2) Can click through for details
3) English version of the website was not captured. Could change scope?
1) Catalogs are out of scope
I am still observing a variety of issued with the results of the webcrawls. In order to prioritize sites for a comprehensive preservation program I will have to analyze the websites one by one. The problem however with running test crawls is that Archive-It gives you a report about a test crawl but not the actual archive. For our purposes where we don’t just want to archive webpages but also documents within the webpage we need to see what object are actually being capture.
It is not perfect but I decide to put the URL for each website into the Wayback Machine as a proxy for a crawl. This has some disadvantages. First, the Wayback Machine crawls a given site at irregular intervals. So the results are essentially random. Secondly, it is not clear to what extent a generic crawl captures a site. Setting crawl parameters is more effective to capture everything on a site so looking at the Wayback Machine results won’t replicate a customized crawl. On the other hand, by using the Wayback Machine I feel like I can get a sense of the extreme cases of which sites archive well and those that cannot be captured at all. It is likely that the sites that land in a middle category would be those that would require extra effort to set crawl parameters.
I decide to collect the data on how many times the Wayback Machine has captured each site and look at the results. I plan to record any issues that arise in getting the catalogs form the Wayback Machine crawl to attempt to quantify the possibilities and problems of web-archiving overall.
Debbie Kempe arranged for me to travel to Pasadena for the ARLIS/NA Annual Conference. I was very grateful to have the opportunity to go and meet the top art librarians talk about their projects. I had a great time and it was there I learned about the FAB project and saw a presentation about data visualizations using the Getty Provenance Index.