[nfais-l] NFAIS Enotes, 2013, Number 3, Public Data, Public Access, Public Discussion

Wed Jul 24 14:42:41 EDT 2013

NFAIS Enotes, 2013  Number 3

Public Data, Public Access, Public Discussion
Written and Compiled by Jill O’Neill

At the end of February  2013, the Office of Science and Technology Policy (OSTP), part of the Executive Office of the President, issued a memorandum to the heads of executive departments and agencies pertaining to improving and increasing access to federally-funded scientific research, both in the forms of published articles as well as digital scientific data. Those agencies with more than $100 million in annual conduct of research and development expenditures were instructed to develop strategies and approaches for long term preservation as well as the means for searching, retrieving,g and analyzing documented results from the Federal research investment. In particular on page 4 of the memorandum, the OSTP suggests that “Repositories could be maintained by the Federal agency funding the research, through an arrangement with other Federal agencies, or through other parties working in partnership with the agency including, but not limited, to scholarly and professional associations, publishers, and libraries.”  The full text of the OSTP Memorandum (also known as the Holdren memo) may be found at: [http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf] http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf.

David Crotty of the Oxford University Press voiced his first impression of the February 22 memo on the Scholarly Kitchen blog saying that (in his view) the memo was fair and even-handed, even an enhancement to the work of both Open Access advocates and the publishing community (See [http://scholarlykitchen.sspnet.org/2013/02/25/expanding-public-access-to-the-results-of-federally-funded-research-first-impressions-on-the-us-governments-policy/] http://scholarlykitchen.sspnet.org/2013/02/25/expanding-public-access-to-the-results-of-federally-funded-research-first-impressions-on-the-us-governments-policy/).

Following up in May, there were two one-day events sponsored by the National Academy of Sciences to hear commentary from the various communities with regard to the OSTP memorandum, one day devoted to hearing concerns surrounding the impact of the policy on publications and the second day devoted to concerns surrounding scientific research data. Coverage by Information Today gives a sense of what some of the top concerns were -- publisher embargoes, infrastructure/interoperability, and cost considerations (see: [http://newsbreaks.infotoday.com/NewsBreaks/Dialogue-Over-Public-Access-to-Scholarly-Publications-Continues-in-the-US-89803.asp] http://newsbreaks.infotoday.com/NewsBreaks/Dialogue-Over-Public-Access-to-Scholarly-Publications-Continues-in-the-US-89803.asp).  A more extensive, in-depth write-up of the events by science librarian and Ph.D. candidate, Shannon Bole, may be found at [http://www.scilogs.com/scientific_and_medical_libraries/what-is-e-science-and-how-should-it-be-managed/] http://www.scilogs.com/scientific_and_medical_libraries/what-is-e-science-and-how-should-it-be-managed/.

In the wake of those public expressions of concern, various potential partners for Federal agencies (that is, associations, publishers, universities and libraries) brought forward their proposals for compliance with the OSTP memorandum. The first was CHORUS, proposed by the Association of American Publishers. Essentially, CHORUS (Clearing House for the Open Research of the United States) would aggregate the metadata associated with federally-funded research onto a searchable platform. The searcher would then be re-directed to the appropriate publisher platform where the version of record of the published research could then be accessed. Rather than suggesting that agencies build new repositories on the order of PubMed Central, compliance with the OSTP memorandum could be accomplished through a decentralized approach that is very nearly fully operational. The CHORUS approach is heavily dependent upon publisher participation in CrossRef ([http://www.crossref.org/] http://www.crossref.org) and involvement with several of their initiatives such as FundRef (development of metadata which ties agency funding to specific research projects) and Prospect (development of standardized APIs and data representation for purposes of facilitating text- and data-mining). For more detail on just how this is envisioned, see AAP statements at [http://www.publishers.org/press/107/] http://www.publishers.org/press/107/ with updates at: [http://www.publishers.org/press/110/] http://www.publishers.org/press/110/. A presentation to attendees of the United Kingdom Serials Group (UKSG) meeting in April by Frederic Dylla, Executive Director and CEO of the American Institute of Physics (AIP), provides a great deal of background information about the development of CHORUS and is well worth viewing (see: ([http://www.aip.org/aip/evolving_view.pdf] http://www.aip.org/aip/evolving_view.pdf). 

Also worth noting, however, with regard to the CHORUS approach (which has been presented as being largely in place) is that one of the aspects not yet ready to go has to do with text and data mining. The CHORUS proposal references CrossRef’s new text and data mining project, Prospect, but the proposed system is only in its pilot phase and neither broadly known nor tested. The Dylla presentation above as well as a presentation by Ed Pentz, Executive Director, CrossRef to the STM Association in May provide more background about the project (See [http://river-valley.tv/prospect-a-text-mining-initiative-from-crossref/] http://river-valley.tv/prospect-a-text-mining-initiative-from-crossref/).

At least one agency, the Department of Energy through its Office of Scientific and Technical Information (OSTI), has a tool that can operate in conjunction with CHORUS. This is the Public Access Gateway for Energy and Science (PAGES). It is a working prototype that does the type of redirecting of traffic to distributed collections in much the fashion that the CHORUS proposal advocates. A presentation given by Dr. Walter Warnick, the Director of OSTI, fills in the gaps of the program (see [http://www.cendi.gov/presentations/01_09_13_DOE.pdf] http://www.cendi.gov/presentations/01_09_13_DOE.pdf. 

The second proposal came from the Association of Research Libraries (ARL), the Association of American Universities (AAU) and the Association of Public and Land Grant Universities (APLU) with the acronym of SHARE (Shared Access Research Ecosystem). Building on the existence of institutional repositories at many research universities, this proposal offers the alternative to agencies of using those repositories for depositing and ensuring access to published research and data sets. This negates any need for a Federal agency to build its own repository (a bit of a non-starter as an option, since the memorandum had already specified that any plan or approach adopted by the agency would have to fit within existing budgeted resources). SHARE depends heavily on the adoption by agencies of a standardized set of metadata fields for purposes of successful ingestion and subsequent user discovery. An article in Library Journal does a good job of explaining the ins and outs (see: [http://lj.libraryjournal.com/2013/06/oa/arl-launches-library-led-solution-to-federal-open-access-requirements/] http://lj.libraryjournal.com/2013/06/oa/arl-launches-library-led-solution-to-federal-open-access-requirements/), but the primary document outlining SHARE may be found at: [http://www.arl.org/storage/documents/publications/share-proposal-07june13.pdf] http://www.arl.org/storage/documents/publications/share-proposal-07june13.pdf. The biggest challenges to implementation of the SHARE approach is the uneven status of existing academic repositories, the uncertainty of funding at the state level for the parent institutions housing these repositories, and the time lag needed to get the SHARE scheme up and running across the network of institutions. 

With regard to data- and text-mining, the SHARE proposal recognizes in much the same way that CHORUS does that this is a licensing issue. The SHARE solution notes that “copyright licenses need to be granted on a non-exclusive basis” to both agencies and universities.  As a follow-up note, the SHARE proposal says on page 9, “academic research programs are rapidly developing strategies centered on the challenges of big data and correspondingly the development of data science or data analytics. The corpus of digital repository content, both full text articles as well as the associated data sets, will provide a rich resource for these research programs to experiment with, test and develop new methods to extract meaning and relationships from the repositories.” 

That Library Journal article mentioned was careful to note that CHORUS and SHARE are not mutually exclusive. Properly engineered in terms of interoperability, a hybrid of the two proposals might easily relieve the burden of compliance that Federal agencies face in meeting their obligation, sharing it across the various constituencies most likely to benefit from a collaborative agreement. 

The May 2013 meeting of the Confederation of Open Access Repositories (COAR) membership, largely European in make-up, featured a talk by Tony Hey of Microsoft Research that seemed to suggest that the SHARE proposal or something like it would be a logical open access approach. Hey stated that major university research libraries should develop a federated repository system, although he noted that there were challenges associated with federated search and the “Invisible Web.” (Hey said that this was the only way that libraries might avoid the disintermediation that was already in progress due to the accomplishments of search giants Google, Yahoo, and Microsoft in enabling discovery of new research from the researcher’s desktop – see: [http://www.coar-repositories.org/files/TonyHey-COAR-Talk.pdf] http://www.coar-repositories.org/files/TonyHey-COAR-Talk.pdf).

But both the public and private sector are still faced with those three major pain points -- the potential time lag in providing public access to the final version of record (embargoes), the interoperability of the various systems as the user/searcher gets transferred across platforms in pursuit of relevant content, and the costs of ensuring long-term sustainability. Documentation of challenges in addition to those pain points have appeared in various forms over the course of the past twelve to eighteen months. For example, gaps in preparedness in archiving, preserving and curating data (insofar as both researcher and librarian are concerned) were noted in the CLIR publication, The Problem of Data, in August of 2012 (See: [http://www.clir.org/pubs/reports/pub154/pub154.pdf] http://www.clir.org/pubs/reports/pub154/pub154.pdf). The report notes that the lack of expert training for information professionals in this specific arena is a very serious concern.

The fear that publishers have of the potential impact of government repositories of published material – such as PubMed Central -- has been captured in research done by Phil Davis, independent consultant and contributor to the SSP Scholarly Kitchen blog. Davis documented a loss in traffic to publishers’ sites as searchers were diverted to full text available in the PubMed Central repository (See [http://scholarlykitchen.sspnet.org/2013/04/04/pubmed-central-reduces-publisher-traffic-study-shows/] http://scholarlykitchen.sspnet.org/2013/04/04/pubmed-central-reduces-publisher-traffic-study-shows/).  As Davis notes there, “While PMC may be providing complementary access to readers traditionally underserved by scientific journals, the loss of article readership from the journal website may weaken the ability of the journal to build communities of interest around research papers, impede the communication of news and events to scientific society members and journal readers, and reduce the perceived value of the journal to institutional subscribers.”  Elsewhere Davis also challenged the findings of a UK study of institutional repositories that was used as evidence that repositories posed no challenge to the profitability of commercial publishers (see: [http://scholarlykitchen.sspnet.org/2013/04/24/peer-repository-study-recast/] http://scholarlykitchen.sspnet.org/2013/04/24/peer-repository-study-recast/).

Hence the interest in ensuring that embargoes are in place to protect the publisher’s investment before deposit to a repository becomes widely accessible. The White House directive mentions a twelve month post-publication embargo period, but as a guideline on the grounds that the needs across disciplines for such access may vary. Those with an interest can use the query “article embargo institutional repositories” in their favorite Web search tool to see the variety of publisher stance on such time frames as well as the instructional materials provided by research libraries regarding content embargoes. 

While the White House memo has sparked this conversation, it’s certainly not just a U.S. problem. Globally, the various stakeholders in the information community have been thinking about how implementation of repositories might work. Both best practices as well as the challenges of establishing and populating institutional repositories for use across a globally-networked environment have been documented by the previously mentioned COAR organization.  COAR released its report on the current state of open access repository interoperability in October of 2012, and followed up in late June 2013 with a report on sustainable practices for populating such repositories. A third report from COAR as to the future directions of institutional repositories is due to appear later this year. For more information, consult the following Web pages: 
1. [http://www.coar-repositories.org/activities/repository-interoperability/coar-interoperability-project/the-current-state-of-open-access-repository-interoperability-2012/] http://www.coar-repositories.org/activities/repository-interoperability/coar-interoperability-project/the-current-state-of-open-access-repository-interoperability-2012/
2. [http://www.coar-repositories.org/activities/repository-content/sustainable-practices-for-populating-repositories-report/] http://www.coar-repositories.org/activities/repository-content/sustainable-practices-for-populating-repositories-report/
3. [http://www.coar-repositories.org/activities/repository-interoperability/coar-interoperability-project/coar-interoperability-roadmap/] http://www.coar-repositories.org/activities/repository-interoperability/coar-interoperability-project/coar-interoperability-roadmap/

But even with those issues, the continuing emergence of international data repositories such as Zenodo ([http://www.zenodo.org/] http://www.zenodo.org) stands as testament to the idea that sustainable repositories are the solution of choice. 

The final question mark in this discussion is in the cost-management issue. In an April 2012 report, Lasting Impact: Sustainability of Disciplinary Repositories, Ricky Erway, Senior Program Officer, OCLC Research, offered a list of funding options for academic libraries looking into the idea of building a repository. Her list (see page 16) included the following: 
● institutional support
● use based institutional contributions
● support via consortium dues
● distributed network of volunteers
● federal government funding
● decentralized arrangement
● commercial “freemium” service (basic access is free; value-added services for fee)

Lasting Impact: Sustainability of Disciplinary Repositories may be downloaded from: [http://www.oclc.org/content/dam/research/publications/library/2012/2012-03.pdf] http://www.oclc.org/content/dam/research/publications/library/2012/2012-03.pdf 

A case study that appeared in the ASIS&T Bulletin (October/November 2012) gives some of the actual dollar amounts required by Cornell University in sustaining arXive as a repository. NFAIS members may recall that Cornell University introduced a modified consortium/membership model for support of arXive in 2010, but have since also worked to attract additional matching funds to ensure its continued health. For the case study, see [http://www.asis.org/Bulletin/Oct-12/OctNov12_Rieger.html] http://www.asis.org/Bulletin/Oct-12/OctNov12_Rieger.html, but for the most recent update on gifts and matching funds ( See: [http://news.cornell.edu/stories/2012/08/arxiv-now-worldwide-consortium] http://news.cornell.edu/stories/2012/08/arxiv-now-worldwide-consortium). 

Because the interests of NFAIS members are primarily oriented towards support for professional researchers and scholars, one might think that the February memorandum would be of the greatest interest to this community. But there is a second wave that is also buffeting this membership.

The White House itself called it a Big Day for Open Data ([http://www.whitehouse.gov/blog/2013/05/10/recap-big-day-open-data] http://www.whitehouse.gov/blog/2013/05/10/recap-big-day-open-data). The second executive order and memo was issued by the White House Office of Management and Budget in May of this year. Entitled Open Data Policy – Managing Information as an Asset, the so-called Burwell memo required agencies to:
(1) Collect or create information in a way that supports downstream information processing and dissemination activities
(2) Build information systems to support interoperability and information accessibility
(3) Strengthen data management and release practices
(4) Strengthen measures to ensure that privacy and confidentiality are fully protected and that data are properly secured
(5) Incorporate new interoperability and openness requirements into core agency processes
([http://www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf%20] http://www.whitehouse.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf)

Slate Magazine said “It sends a clear statement from the top that open and machine-readable should be the default for government information…With this executive order, the President and his advisors have focused on using open data for entrepreneurship, innovation, and scientific discovery.”  ([http://www.slate.com/articles/technology/future_tense/2013/05/open_data_executive_order_is_the_best_thing_obama_s_done_this_month.html] http://www.slate.com/articles/technology/future_tense/2013/05/open_data_executive_order_is_the_best_thing_obama_s_done_this_month.html). 

Forrester analyst, Jennifer Belissent, Ph.D., also heralded the significance of this memorandum and noted the clear emphasis on engaging application developers rather than the average citizen(see: [http://blogs.forrester.com/jennifer_belissent_phd/13-05-16-open_data_is_an_asset_new_us_federal_guidance_for_reaching_its_full_potential] http://blogs.forrester.com/jennifer_belissent_phd/13-05-16-open_data_is_an_asset_new_us_federal_guidance_for_reaching_its_full_potential). That emphasis on building for the economic purposes of fueling data-driven innovation raises another issue.  The analyst notes that this represents a marketing challenge for federal agencies; it becomes a question of “How do potential audiences find the data and how do agencies get the word out?  A first step is through data inventories.  The Memo stipulates that agencies create or update their data inventories. The goal is again to include all data – public, potentially public and not to be made public data – over time.  A public listing – of just those data sets already published or potentially publishable – would be made available on Data.gov. But this remains an if-we-build-it-they-will-come marketing strategy.  Promotion requires a strategy for outreach to potential audiences.  Agencies, rev your marketing engines!”  (The OMB has required agencies to have a full inventory of their available data-sets by early December.)

The agenda behind this interest in open data is twofold – one is to make more evident the dollar value of government investment in various research initiatives and the other to make more evident the actual return on investment in order to effectively direct additional spending.  A number of emerging entities, such as New York University’s Governance Lab ([http://www.thegovlab.org/] http://www.thegovlab.org) and the Research Data Alliance ([https://rd-alliance.org/] https://rd-alliance.org) housed at Rensselaer Polytechnic Institute, are indicative of the momentum behind this challenge.

In the context of these two memorandums, agencies are faced with a complex and unfunded mandate to open up access to vast amounts of data, some of which is data captured through government activity – thus falling into the public domain --and some of which is gathered through federally-funded research projects. What will work most successfully for the highly diverse government agencies affected is not likely to be a simple, one-size-fits-all approach. 
The various plans submitted to the OSTP for approval will have to recognize the strengths of both the CHORUS and the SHARE approach but factor those strengths into unique solutions that best serve the diverse research communities that are each agency’s particular care and constituency.

*************************************

2013 NFAIS Supporters

Access Innovations, Inc.

Accessible Archives, Inc.

American Psychological Association/PsycINFO

American Theological Library Association

Annual Reviews

CAS

CrossRef

Data Conversion Laboratory, Inc.

Defense Technical Information Center

Getty Research Institute

The H. W. Wilson Foundation

Information Today, Inc.

IFIS

Modern Language Association

OCLC

Philosopher’s Information Center

ProQuest

RSuite CMS

Scope e-Knowledge Center

TEMIS, Inc.

Thomson Reuters IP & Science

Thomson Reuters IP Solutions

Unlimited Priorities LLC

********************************

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lyralists.lyrasis.org/pipermail/nfais-l/attachments/20130724/51980abc/attachment.html>