Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting list of all CPAN package names is broken #1961

Open
andrew opened this issue Feb 7, 2018 · 2 comments · May be fixed by #2817
Open

Getting list of all CPAN package names is broken #1961

andrew opened this issue Feb 7, 2018 · 2 comments · May be fixed by #2817

Comments

@andrew
Copy link
Contributor

andrew commented Feb 7, 2018

Paging through the CPAN releases API no longer works for results greater than 10,000

Code location: https://github.com/librariesio/libraries.io/blob/master/app/models/package_manager/cpan.rb#L17

Example url:

https://fastapi.metacpan.org/v1/release/_search?fields=distribution&from=10000&q=status%3Alatest&size=5000&sort=date%3Adesc

Error:

{
"message": "[Request] ** [http://127.0.0.1:9200]-[500] {\"error\":{\"root_cause\":[{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\"cpan_v1_01\",\"node\":\"euEoqisPSk68CnedNAzoZA\",\"reason\":{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}}]},\"status\":500}, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125. With vars: {'request' => {'method' => 'GET','ignore' => [],'path' => '/cpan/release/_search','serialize' => 'std','qs' => {'q' => 'status:latest','fields' => 'distribution','sort' => 'date:desc','size' => 5000,'from' => 10000},'body' => undef},'status_code' => 500}\n"
}

The docs suggest using the scroll api: https://github.com/metacpan/metacpan-api/blob/master/docs/API-docs.md#being-polite but the links to the docs are dead.

More recent scroll api docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html but I couldn't seem to get it to accept scroll_id as a parameter:

{
"message": "[Param] ** Unknown param (scroll_id) in (search) request. See docs at: http://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125."
}
andrew added a commit that referenced this issue Feb 7, 2018
@zmughal
Copy link

zmughal commented May 18, 2021

Perhaps this will help. I updated the links for the metacpan-api documentation.

#!/bin/bash

COUNT="5000"
OUTPUT_FILE="/tmp/metacpan-dists.jsonl"

echo "Request: 1";
JSON0="$(curl -s "https://fastapi.metacpan.org/v1/release/_search?scroll=1m&size=$COUNT&q=status:latest&fields=distribution")";

TOTAL=$( echo $JSON0 | jq '.hits.total' )
echo "Total dists: $TOTAL"
REQUESTS_N=$(( ( $TOTAL + $COUNT - 1 )/$COUNT ))
echo "Will make $REQUESTS_N requests total";

SCROLL_ID=$(echo $JSON0 | jq -r '._scroll_id');

echo $JSON0 | jq '.hits.hits | .[].fields.distribution' > $OUTPUT_FILE

for i in $( seq 2 $REQUESTS_N ); do
	echo "Request: $i";
	JSON="$(curl -s -XPOST 'https://fastapi.metacpan.org/v1/_search/scroll?scroll=1m' -d $SCROLL_ID)";
	SCROLL_ID=$(echo $JSON | jq -r '._scroll_id');
	echo $JSON | jq '.hits.hits | .[].fields.distribution | .[]' >> $OUTPUT_FILE;
done

sort -u $OUTPUT_FILE | wc -l

zmughal added a commit to zmughal/libraries.io that referenced this issue May 18, 2021
zmughal added a commit to zmughal/libraries.io that referenced this issue May 18, 2021
zmughal added a commit to zmughal/libraries.io that referenced this issue May 18, 2021
zmughal added a commit to zmughal/libraries.io that referenced this issue May 18, 2021
This reverts commit 9c6d0f9.

Connects with <librariesio#1961>.
zmughal added a commit to zmughal/libraries.io that referenced this issue May 30, 2021
zmughal added a commit to zmughal/libraries.io that referenced this issue May 30, 2021
This reverts commit 9c6d0f9.

Connects with <librariesio#1961>.
@M00NZ1R94
Copy link

Definition of the money is the same as the best possible option for a good friend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants