Reproducibility notes
This document records how the CAPES/SJR data used in the post are downloaded, filtered, treated, audited, and analyzed. It is intentionally explicit about file names because the main risk in this project is data provenance rather than model complexity.
Current status
The main analysis in the post is still the CAPES 7 analysis. CAPES 6 data have also been downloaded and audited so that they can be used later as an “what happens if we also include CAPES 6?” robustness section.
The implemented analysis window is 2017-2020. Older 2013-2016 CAPES files exist in some places in the project, but they are not part of the current processed analysis because the accessible article-detail datastore for that period is incomplete.
Main directories
config/datasets.json: official CAPES resource configuration and direct URLs.data/raw/capes/programs/: official CAPES program metadata CSVs.data/raw/capes_filtered/article_details/: program-filtered article-detail rows from the CAPES datastore API.data/raw/capes_filtered/authors/: program-filtered author rows from the CAPES datastore API.data/raw/capes_filtered/docentes/: program-filtered docente rows from the CAPES datastore API.data/raw/scimago/: SJR/SCImago journal metric files.data/manual/: derived program lists used to define the sample.data/processed/: generated SQLite database and analysis CSVs.reports/: audit outputs and narrative summaries.figures/: figures generated byanalysis.R.
Official CAPES inputs
The CAPES program metadata files are configured under the programs_2017_2020 dataset in config/datasets.json. The local files are:
data/raw/capes/programs/br-capes-colsucup-prog-2017-2021-11-10.csvdata/raw/capes/programs/br-capes-colsucup-prog-2018-2021-11-10.csvdata/raw/capes/programs/br-capes-colsucup-prog-2019-2021-11-10.csvdata/raw/capes/programs/br-capes-colsucup-prog-2020-2021-11-10.csv
These files contain, among other fields, AN_BASE, CD_PROGRAMA_IES, NM_PROGRAMA_IES, SG_ENTIDADE_ENSINO, NM_ENTIDADE_ENSINO, NM_AREA_AVALIACAO, and CD_CONCEITO_PROGRAMA.
Download them with:
python3 scripts/download_capes.py --dataset programs_2017_2020The CAPES article-detail resource used in the current analysis is:
- resource id:
6646b204-8db4-4e41-b59f-f24f87eed6e4 - configured URL in
config/datasets.json - subtype: bibliographic production, journal articles, ARTPE
The full official download target is configured as:
data/raw/capes/article_details/br-colsucup-prod-detalhe-bibliografica-2017a2020-2022-06-30-artpe.csv
For this project, filtered files were fetched by CD_PROGRAMA_IES through the CAPES datastore API, producing:
data/raw/capes_filtered/article_details/capes7_article_details_2017_2020_artpe.csvdata/raw/capes_filtered/article_details/capes6_article_details_2017_2020_artpe.csv
The CAPES author resources are yearly resources from the 2017-2020 ARTPE author dataset:
data/raw/capes_filtered/authors/capes7_authors_2017_artpe.csvdata/raw/capes_filtered/authors/capes7_authors_2018_artpe.csvdata/raw/capes_filtered/authors/capes7_authors_2019_artpe.csvdata/raw/capes_filtered/authors/capes7_authors_2020_artpe.csvdata/raw/capes_filtered/authors/capes6_authors_2017_artpe.csvdata/raw/capes_filtered/authors/capes6_authors_2018_artpe.csvdata/raw/capes_filtered/authors/capes6_authors_2019_artpe.csvdata/raw/capes_filtered/authors/capes6_authors_2020_artpe.csv
The filtered docente datastore calls currently produce empty files:
data/raw/capes_filtered/docentes/capes7_docentes_2017.csvdata/raw/capes_filtered/docentes/capes7_docentes_2018.csvdata/raw/capes_filtered/docentes/capes7_docentes_2019.csvdata/raw/capes_filtered/docentes/capes7_docentes_2020.csvdata/raw/capes_filtered/docentes/capes6_docentes_2017.csvdata/raw/capes_filtered/docentes/capes6_docentes_2018.csvdata/raw/capes_filtered/docentes/capes6_docentes_2019.csvdata/raw/capes_filtered/docentes/capes6_docentes_2020.csv
This is not a local download failure: API audits returned zero official rows for these filtered docente resources. The project therefore uses observed faculty author rows where professor-level denominators are needed, and that limitation should be stated in any publication.
SJR input
The current journal metric file is:
data/raw/scimago/scimagojr_sjr_best_quartile_2024.csv
It has 31,136 data rows plus a header. This is a 2024 proxy for SJR indexing and quartile assignment. It should eventually be replaced with historical SJR files for 2017, 2018, 2019, and 2020 if those files can be obtained.
The analysis uses ISSN matching. ISSNs are cleaned by removing non-alphanumeric characters and keeping the first eight 0-9X characters.
Program selection
Program lists are derived from the official CAPES program CSVs with:
python3 scripts/derive_capes7_programs.py
python3 scripts/derive_capes7_programs.py --scores 6 --out data/manual/capes6_programs.csvThe resulting files are:
data/manual/capes7_programs.csvdata/manual/capes6_programs.csv
Both files have the same schema:
cd_programa_ies,nm_programa_ies,sg_entidade_ensino,nm_entidade_ensino,area_avaliacao,discipline_group,capes_score,focal_field
The current study-area sample contains 14 CAPES evaluation areas:
- Administração Pública e de Empresas, Ciências Contábeis e Turismo
- Astronomia / Física
- Ciência da Computação
- Ciência Política e Relações Internacionais
- Ciências Biológicas I
- Economia
- Educação
- Filosofia
- História
- Linguística e Literatura
- Matemática / Probabilidade e Estatística
- Psicologia
- Química
- Sociologia
The current counts are:
- CAPES 7: 67 programs, including 6 in Linguística e Literatura.
- CAPES 6: 81 programs, including 13 in Linguística e Literatura.
The focal-field flag focal_field is 1 when the normalized area/program text contains LINGUIST, LETRAS, or LITERATURA; otherwise it is 0.
Filtered CAPES downloads
Filtered CAPES files are downloaded from the CAPES datastore API with:
python3 scripts/fetch_capes7_datastore.py \
--programs data/manual/capes7_programs.csv \
--label capes7 \
--resources article_details_2017_2020 authors_2017 authors_2018 authors_2019 authors_2020 docentes_2017 docentes_2018 docentes_2019 docentes_2020
python3 scripts/fetch_capes7_datastore.py \
--programs data/manual/capes6_programs.csv \
--label capes6 \
--resources article_details_2017_2020 authors_2017 authors_2018 authors_2019 authors_2020 docentes_2017 docentes_2018 docentes_2019 docentes_2020The script pages through CAPES datastore_search using:
filters={"CD_PROGRAMA_IES": program_code}limit=250- increasing
offset - retry logic for timeouts and temporary network failures
Download audits
Local audits are generated with:
python3 scripts/audit_capes_downloads.py \
--label capes7 \
--programs data/manual/capes7_programs.csv \
--out reports/audit_capes7_local.csv
python3 scripts/audit_capes_downloads.py \
--label capes6 \
--programs data/manual/capes6_programs.csv \
--out reports/audit_capes6_local.csvAPI-total audits can also be run, but they are slow because they issue one count query per program and resource:
python3 scripts/audit_capes_downloads.py \
--label capes7 \
--programs data/manual/capes7_programs.csv \
--api \
--out reports/audit_capes7_api.csv
python3 scripts/audit_capes_downloads.py \
--label capes6 \
--programs data/manual/capes6_programs.csv \
--api \
--out reports/audit_capes6_api.csvCurrent audit files:
reports/audit_capes7_local.csvreports/audit_capes6_local.csvreports/audit_capes7_api.csvreports/audit_capes6_api.csv
Important audit results:
- CAPES 7 article details: 42,235 local rows, 67 selected programs, no missing or extra programs.
- CAPES 6 article details: 35,972 local rows, 81 selected programs, no missing or extra programs.
- CAPES 7 author files now match the API totals used in the audit: 3,172 rows for 2017, 4,351 for 2018, 13,367 for 2019, and 48,175 for 2020.
- CAPES 6 author files match API totals: 1,217 rows for 2017, 2,973 for 2018, 2,622 for 2019, and 35,665 for 2020.
- All filtered docente files have zero rows, and API audits also returned zero rows for those filtered resources.
One data-management issue was found and repaired during this audit: the older capes7_* filtered raw files did not cover all 67 current CAPES 7 programs. They were re-fetched from the official CAPES API. The repaired capes7_article_details_2017_2020_artpe.csv now matches the API total exactly.
Data treatment
The CAPES 7 processed tables are built with:
python3 scripts/build_database.py
python3 scripts/export_summaries.py
Rscript analysis.Rscripts/build_database.py currently builds the main CAPES 7 analysis. It imports raw CAPES files from data/raw/capes_filtered/, imports data/manual/capes7_programs.csv, imports SJR files from data/raw/scimago/, and writes:
data/processed/study.sqlitedata/processed/articles.csvdata/processed/professor_year.csvdata/processed/program_year.csvdata/processed/discipline_summary.csvdata/processed/match_audit.csv
The main normalized article table is data/processed/articles.csv. Its key columns include:
cd_programa_iesnm_programa_iessg_entidade_ensinonm_entidade_ensinoyearproduction_idtitlelanguage_rawdoiissn_rawjournal_vehicle_idissn_cleanis_englisharea_avaliacaodiscipline_groupfocal_fieldsjr_quartilesjr_scorejournal_title_sjrjournal_country_sjrjournal_publisher_sjris_q1is_q1_q2
Main transformations:
- Keep only article rows whose year is between 2017 and 2020.
- De-duplicate article rows at the program-year-production level:
cd_programa_ies,year,production_id. - Use
DS_IDIOMAto createis_english. - Clean
DS_ISSNintoissn_clean. - Match
issn_cleanto SJR ISSNs. - Keep non-indexed articles in the denominator; absence from SJR is treated as substantively informative, not random missingness.
- Create
is_q1andis_q1_q2from SJR quartiles.
The observation unit for field summaries is a program-article record. If the same intellectual production is reported by two programs, it is counted once for each reporting program because the analysis compares program/area publication profiles.
Analysis outputs
analysis.R reads:
data/processed/articles.csvdata/processed/discipline_summary.csvdata/processed/program_year.csvdata/processed/match_audit.csv
It writes:
data/processed/analysis_discipline_quality.csvdata/processed/analysis_quartile_distribution.csvdata/processed/analysis_program_quality.csvdata/processed/analysis_focal_contrasts.csvdata/processed/analysis_language_by_indexing.csvdata/processed/analysis_english_mechanism.csvdata/processed/analysis_discipline_language_mechanism.csvdata/processed/analysis_sjr_country_summary.csvdata/processed/analysis_logit_models.csv
It also writes figures to figures/, including:
figures/pct_english_by_discipline.pngfigures/sjr_indexing_rate_by_discipline.pngfigures/sjr_quartile_distribution_by_discipline.pngfigures/q1q2_among_indexed_by_discipline.pngfigures/english_vs_q1q2_by_discipline.pngfigures/english_share_over_time.pngfigures/sjr_indexing_by_language_and_field.pngfigures/q1q2_by_language_and_field.png
The Quarto post ../index.qmd reads lattes/data/processed/articles.csv and lattes/data/manual/capes7_programs.csv directly, then computes display tables and figures inside the document.
CAPES 6 afterthought analysis
The CAPES 6 raw files are downloaded and audited, but the current processed tables are still CAPES 7 only. To add CAPES 6 as an afterthought section, the next clean step is to build a combined score-coded analysis table rather than mixing CAPES 6 into the existing CAPES 7 table silently.
Recommended combined schema:
- Add
capes_scoreto every article row. - Add
score_group, e.g.CAPES 6,CAPES 7. - Preserve
focal_field. - Preserve
discipline_group.
Recommended comparisons:
- Keep all current CAPES 7-only figures as the main analysis.
- Add a section titled something like “E se incluirmos programas CAPES 6?”
- In that section, show CAPES 6 and CAPES 7 separately or side by side.
- Avoid replacing the central CAPES 7 claim with a pooled CAPES 6/7 result.
Known limitations
- SJR quartiles currently use
data/raw/scimago/scimagojr_sjr_best_quartile_2024.csvas a proxy. Historical 2017-2020 SJR files would be better. - The filtered docente CAPES datastore resources return zero rows for the selected programs. This affects professor-denominator analyses.
- Author resources for 2017, 2018, and 2019 do not cover every selected program in the CAPES API itself. This is not a local download issue; API audits show the local files match official datastore totals. The 2020 author resource covers all selected CAPES 6 and CAPES 7 programs.
- The program list treats CAPES score as a selected score observed in the 2017-2020 program metadata files. For a publication, decide whether the score should be fixed at one reference year or treated year by year.
Suggested simplification for a future repository
For a public replication repository, simplify the workflow. A reader should not need to run many scripts or wait on slow CAPES API calls.
Recommended structure:
data/
raw/
capes_programs_2017_2020.csv
capes_article_details_filtered_2017_2020.csv
capes_authors_filtered_2017_2020.csv
scimago_sjr_2024.csv
processed/
analysis_input.RData
analysis_input.rds
scripts/
reproduce_analysis.R
README.md
Recommended analysis_input.RData objects:
programs: score-coded CAPES 6/7 program list.articles: normalized article table with SJR match columns.discipline_summary: summary by discipline and score.program_year: summary by program/year/score.audit_reports: compact versions of the local/API audit tables.
Recommended single script:
scripts/reproduce_analysis.R
That script should:
- Load
analysis_input.RData. - Recompute all tables used in the article.
- Recompute all figures.
- Optionally rebuild
analysis_input.RDatafrom raw CSVs ifREBUILD=TRUE.
The slow CAPES API download/audit scripts should be preserved for provenance, but they should not be required for ordinary reproduction of the published analysis.
Copyright © Guilherme Duarte Garcia