Retrieving Publications Data from Scholar.
From LiquidPubWiki
In order to retrieve the data from the publications we need to do some parsing.
Step One: get the number of results
For this we parse using the tag (table). Then we extract the number of results from the first node.
The numbers of results is useful because with it we know how many pages we need to read (we get 100 results per page because that is the maximun number of publications that the scholar let us retrieve in each page).
Step Two: start the parsing for each publication
The information that we get for each publication is:
- Title and URL: for this we look for the tag (h3 class="r"). Then we look for the text "a href" to get the URL, and after this is the title.
- AuthorList, publication, year and publisher: for this we look for the tag (span class="a"). Then we look for the first appearence of the char "-",before this char is the list of authors and after this char is the publication, year and publisher in that same order.
- Citation count: for this we look for the tag (span class=fl). Then we look for the text "Cited by" for the citation count and the text "a href" to get the URL for the citing documents.
