Retrieving Publications Data from Scholar.

From LiquidPubWiki

Jump to: navigation, search

In order to retrieve the data from the publications we need to do some parsing.

Step One: get the number of results

For this we parse using the tag (table). Then we extract the number of results from the first node.

The numbers of results is useful because with it we know how many pages we need to read (we get 100 results per page because that is the maximun number of publications that the scholar let us retrieve in each page).

Step Two: start the parsing for each publication

The information that we get for each publication is:

- Title and URL: for this we look for the tag (h3 class="r"). Then we look for the text "a href" to get the URL, and after this is the title.

- AuthorList, publication, year and publisher: for this we look for the tag (span class="a"). Then we look for the first appearence of the char "-",before this char is the list of authors and after this char is the publication, year and publisher in that same order.

- Citation count: for this we look for the tag (span class=fl). Then we look for the text "Cited by" for the citation count and the text "a href" to get the URL for the citing documents.

Personal tools