Reproducibility notes

This document records how the CAPES/SJR data used in the post are downloaded, filtered, treated, audited, and analyzed. It is intentionally explicit about file names because the main risk in this project is data provenance rather than model complexity.

Current status

The main analysis in the post is still the CAPES 7 analysis. CAPES 6 data have also been downloaded and audited so that they can be used later as an “what happens if we also include CAPES 6?” robustness section.

The implemented analysis window is 2017-2020. Older 2013-2016 CAPES files exist in some places in the project, but they are not part of the current processed analysis because the accessible article-detail datastore for that period is incomplete.

Main directories

config/datasets.json: official CAPES resource configuration and direct URLs.
data/raw/capes/programs/: official CAPES program metadata CSVs.
data/raw/capes_filtered/article_details/: program-filtered article-detail rows from the CAPES datastore API.
data/raw/capes_filtered/authors/: program-filtered author rows from the CAPES datastore API.
data/raw/capes_filtered/docentes/: program-filtered docente rows from the CAPES datastore API.
data/raw/scimago/: SJR/SCImago journal metric files.
data/manual/: derived program lists used to define the sample.
data/processed/: generated SQLite database and analysis CSVs.
reports/: audit outputs and narrative summaries.
figures/: figures generated by analysis.R.

Official CAPES inputs

The CAPES program metadata files are configured under the programs_2017_2020 dataset in config/datasets.json. The local files are:

data/raw/capes/programs/br-capes-colsucup-prog-2017-2021-11-10.csv
data/raw/capes/programs/br-capes-colsucup-prog-2018-2021-11-10.csv
data/raw/capes/programs/br-capes-colsucup-prog-2019-2021-11-10.csv
data/raw/capes/programs/br-capes-colsucup-prog-2020-2021-11-10.csv

These files contain, among other fields, AN_BASE, CD_PROGRAMA_IES, NM_PROGRAMA_IES, SG_ENTIDADE_ENSINO, NM_ENTIDADE_ENSINO, NM_AREA_AVALIACAO, and CD_CONCEITO_PROGRAMA.

Download them with:

python3 scripts/download_capes.py --dataset programs_2017_2020

The CAPES article-detail resource used in the current analysis is:

resource id: 6646b204-8db4-4e41-b59f-f24f87eed6e4
configured URL in config/datasets.json
subtype: bibliographic production, journal articles, ARTPE

The full official download target is configured as:

data/raw/capes/article_details/br-colsucup-prod-detalhe-bibliografica-2017a2020-2022-06-30-artpe.csv

For this project, filtered files were fetched by CD_PROGRAMA_IES through the CAPES datastore API, producing:

data/raw/capes_filtered/article_details/capes7_article_details_2017_2020_artpe.csv
data/raw/capes_filtered/article_details/capes6_article_details_2017_2020_artpe.csv

The CAPES author resources are yearly resources from the 2017-2020 ARTPE author dataset:

data/raw/capes_filtered/authors/capes7_authors_2017_artpe.csv
data/raw/capes_filtered/authors/capes7_authors_2018_artpe.csv
data/raw/capes_filtered/authors/capes7_authors_2019_artpe.csv
data/raw/capes_filtered/authors/capes7_authors_2020_artpe.csv
data/raw/capes_filtered/authors/capes6_authors_2017_artpe.csv
data/raw/capes_filtered/authors/capes6_authors_2018_artpe.csv
data/raw/capes_filtered/authors/capes6_authors_2019_artpe.csv
data/raw/capes_filtered/authors/capes6_authors_2020_artpe.csv

The filtered docente datastore calls currently produce empty files:

data/raw/capes_filtered/docentes/capes7_docentes_2017.csv
data/raw/capes_filtered/docentes/capes7_docentes_2018.csv
data/raw/capes_filtered/docentes/capes7_docentes_2019.csv
data/raw/capes_filtered/docentes/capes7_docentes_2020.csv
data/raw/capes_filtered/docentes/capes6_docentes_2017.csv
data/raw/capes_filtered/docentes/capes6_docentes_2018.csv
data/raw/capes_filtered/docentes/capes6_docentes_2019.csv
data/raw/capes_filtered/docentes/capes6_docentes_2020.csv

This is not a local download failure: API audits returned zero official rows for these filtered docente resources. The project therefore uses observed faculty author rows where professor-level denominators are needed, and that limitation should be stated in any publication.

SJR input

The current journal metric file is:

data/raw/scimago/scimagojr_sjr_best_quartile_2024.csv

It has 31,136 data rows plus a header. This is a 2024 proxy for SJR indexing and quartile assignment. It should eventually be replaced with historical SJR files for 2017, 2018, 2019, and 2020 if those files can be obtained.

The analysis uses ISSN matching. ISSNs are cleaned by removing non-alphanumeric characters and keeping the first eight 0-9X characters.

Program selection

Program lists are derived from the official CAPES program CSVs with:

python3 scripts/derive_capes7_programs.py
python3 scripts/derive_capes7_programs.py --scores 6 --out data/manual/capes6_programs.csv

The resulting files are:

data/manual/capes7_programs.csv
data/manual/capes6_programs.csv

Both files have the same schema:

cd_programa_ies,nm_programa_ies,sg_entidade_ensino,nm_entidade_ensino,area_avaliacao,discipline_group,capes_score,focal_field

The current study-area sample contains 14 CAPES evaluation areas:

Administração Pública e de Empresas, Ciências Contábeis e Turismo
Astronomia / Física
Ciência da Computação
Ciência Política e Relações Internacionais
Ciências Biológicas I
Economia
Educação
Filosofia
História
Linguística e Literatura
Matemática / Probabilidade e Estatística
Psicologia
Química
Sociologia

The current counts are:

CAPES 7: 67 programs, including 6 in Linguística e Literatura.
CAPES 6: 81 programs, including 13 in Linguística e Literatura.

The focal-field flag focal_field is 1 when the normalized area/program text contains LINGUIST, LETRAS, or LITERATURA; otherwise it is 0.

Filtered CAPES downloads

Filtered CAPES files are downloaded from the CAPES datastore API with:

python3 scripts/fetch_capes7_datastore.py \
  --programs data/manual/capes7_programs.csv \
  --label capes7 \
  --resources article_details_2017_2020 authors_2017 authors_2018 authors_2019 authors_2020 docentes_2017 docentes_2018 docentes_2019 docentes_2020

python3 scripts/fetch_capes7_datastore.py \
  --programs data/manual/capes6_programs.csv \
  --label capes6 \
  --resources article_details_2017_2020 authors_2017 authors_2018 authors_2019 authors_2020 docentes_2017 docentes_2018 docentes_2019 docentes_2020

The script pages through CAPES datastore_search using:

filters={"CD_PROGRAMA_IES": program_code}
limit=250
increasing offset
retry logic for timeouts and temporary network failures

Download audits

Local audits are generated with:

python3 scripts/audit_capes_downloads.py \
  --label capes7 \
  --programs data/manual/capes7_programs.csv \
  --out reports/audit_capes7_local.csv

python3 scripts/audit_capes_downloads.py \
  --label capes6 \
  --programs data/manual/capes6_programs.csv \
  --out reports/audit_capes6_local.csv

API-total audits can also be run, but they are slow because they issue one count query per program and resource:

python3 scripts/audit_capes_downloads.py \
  --label capes7 \
  --programs data/manual/capes7_programs.csv \
  --api \
  --out reports/audit_capes7_api.csv

python3 scripts/audit_capes_downloads.py \
  --label capes6 \
  --programs data/manual/capes6_programs.csv \
  --api \
  --out reports/audit_capes6_api.csv

Current audit files:

reports/audit_capes7_local.csv
reports/audit_capes6_local.csv
reports/audit_capes7_api.csv
reports/audit_capes6_api.csv

Important audit results:

CAPES 7 article details: 42,235 local rows, 67 selected programs, no missing or extra programs.
CAPES 6 article details: 35,972 local rows, 81 selected programs, no missing or extra programs.
CAPES 7 author files now match the API totals used in the audit: 3,172 rows for 2017, 4,351 for 2018, 13,367 for 2019, and 48,175 for 2020.
CAPES 6 author files match API totals: 1,217 rows for 2017, 2,973 for 2018, 2,622 for 2019, and 35,665 for 2020.
All filtered docente files have zero rows, and API audits also returned zero rows for those filtered resources.

One data-management issue was found and repaired during this audit: the older capes7_* filtered raw files did not cover all 67 current CAPES 7 programs. They were re-fetched from the official CAPES API. The repaired capes7_article_details_2017_2020_artpe.csv now matches the API total exactly.

Data treatment

The CAPES 7 processed tables are built with:

python3 scripts/build_database.py
python3 scripts/export_summaries.py
Rscript analysis.R

scripts/build_database.py currently builds the main CAPES 7 analysis. It imports raw CAPES files from data/raw/capes_filtered/, imports data/manual/capes7_programs.csv, imports SJR files from data/raw/scimago/, and writes:

data/processed/study.sqlite
data/processed/articles.csv
data/processed/professor_year.csv
data/processed/program_year.csv
data/processed/discipline_summary.csv
data/processed/match_audit.csv

The main normalized article table is data/processed/articles.csv. Its key columns include:

cd_programa_ies
nm_programa_ies
sg_entidade_ensino
nm_entidade_ensino
year
production_id
title
language_raw
doi
issn_raw
journal_vehicle_id
issn_clean
is_english
area_avaliacao
discipline_group
focal_field
sjr_quartile
sjr_score
journal_title_sjr
journal_country_sjr
journal_publisher_sjr
is_q1
is_q1_q2

Main transformations:

Keep only article rows whose year is between 2017 and 2020.
De-duplicate article rows at the program-year-production level: cd_programa_ies, year, production_id.
Use DS_IDIOMA to create is_english.
Clean DS_ISSN into issn_clean.
Match issn_clean to SJR ISSNs.
Keep non-indexed articles in the denominator; absence from SJR is treated as substantively informative, not random missingness.
Create is_q1 and is_q1_q2 from SJR quartiles.

The observation unit for field summaries is a program-article record. If the same intellectual production is reported by two programs, it is counted once for each reporting program because the analysis compares program/area publication profiles.

Analysis outputs

analysis.R reads:

data/processed/articles.csv
data/processed/discipline_summary.csv
data/processed/program_year.csv
data/processed/match_audit.csv

It writes:

data/processed/analysis_discipline_quality.csv
data/processed/analysis_quartile_distribution.csv
data/processed/analysis_program_quality.csv
data/processed/analysis_focal_contrasts.csv
data/processed/analysis_language_by_indexing.csv
data/processed/analysis_english_mechanism.csv
data/processed/analysis_discipline_language_mechanism.csv
data/processed/analysis_sjr_country_summary.csv
data/processed/analysis_logit_models.csv

It also writes figures to figures/, including:

figures/pct_english_by_discipline.png
figures/sjr_indexing_rate_by_discipline.png
figures/sjr_quartile_distribution_by_discipline.png
figures/q1q2_among_indexed_by_discipline.png
figures/english_vs_q1q2_by_discipline.png
figures/english_share_over_time.png
figures/sjr_indexing_by_language_and_field.png
figures/q1q2_by_language_and_field.png

The Quarto post ../index.qmd reads lattes/data/processed/articles.csv and lattes/data/manual/capes7_programs.csv directly, then computes display tables and figures inside the document.

CAPES 6 afterthought analysis

The CAPES 6 raw files are downloaded and audited, but the current processed tables are still CAPES 7 only. To add CAPES 6 as an afterthought section, the next clean step is to build a combined score-coded analysis table rather than mixing CAPES 6 into the existing CAPES 7 table silently.

Recommended combined schema:

Add capes_score to every article row.
Add score_group, e.g. CAPES 6, CAPES 7.
Preserve focal_field.
Preserve discipline_group.

Recommended comparisons:

Keep all current CAPES 7-only figures as the main analysis.
Add a section titled something like “E se incluirmos programas CAPES 6?”
In that section, show CAPES 6 and CAPES 7 separately or side by side.
Avoid replacing the central CAPES 7 claim with a pooled CAPES 6/7 result.

Known limitations

SJR quartiles currently use data/raw/scimago/scimagojr_sjr_best_quartile_2024.csv as a proxy. Historical 2017-2020 SJR files would be better.
The filtered docente CAPES datastore resources return zero rows for the selected programs. This affects professor-denominator analyses.
Author resources for 2017, 2018, and 2019 do not cover every selected program in the CAPES API itself. This is not a local download issue; API audits show the local files match official datastore totals. The 2020 author resource covers all selected CAPES 6 and CAPES 7 programs.
The program list treats CAPES score as a selected score observed in the 2017-2020 program metadata files. For a publication, decide whether the score should be fixed at one reference year or treated year by year.

Suggested simplification for a future repository

For a public replication repository, simplify the workflow. A reader should not need to run many scripts or wait on slow CAPES API calls.

Recommended structure:

data/
  raw/
    capes_programs_2017_2020.csv
    capes_article_details_filtered_2017_2020.csv
    capes_authors_filtered_2017_2020.csv
    scimago_sjr_2024.csv
  processed/
    analysis_input.RData
    analysis_input.rds
scripts/
  reproduce_analysis.R
README.md

Recommended analysis_input.RData objects:

programs: score-coded CAPES 6/7 program list.
articles: normalized article table with SJR match columns.
discipline_summary: summary by discipline and score.
program_year: summary by program/year/score.
audit_reports: compact versions of the local/API audit tables.

Recommended single script:

scripts/reproduce_analysis.R

That script should:

Load analysis_input.RData.
Recompute all tables used in the article.
Recompute all figures.
Optionally rebuild analysis_input.RData from raw CSVs if REBUILD=TRUE.

The slow CAPES API download/audit scripts should be preserved for provenance, but they should not be required for ordinary reproduction of the published analysis.