ISSN: 2341-2356 WEB DE LA COLECCIÓN: https://www.ucm.es/icae/working-papers Copyright © 2021 by ICAE. Working papers are in draft form and are distributed for discussion. It may not be reproduced without permission of the author/s. Gender Distribution across Topics in the Top 5 Economics Journals: A Machine Learning Approach ICAE Working Paper nº 2109 Keywords: Machine Learning; Gender Gaps; Structural Topic Model; Gendered Language; Research Fields. Abstract June, 2021 JEL Classification I20, J16, Z13. J.Ignacio Conde-Ruiz Fedea Universidad Complutense de Madrid and ICAE Juan-José Ganuza Universitat Pompeu Fabra and Barcelona GSE Manu García Washington University in St. Louis and ICAE Luis A. Puch Universidad Complutense de Madrid and ICAE We analyze all the articles published in the top five (T5) Economics journals be- tween 2002 and 2019 in order to find gender differences in their research approach. We implement an unsupervised machine learning algorithm: the Structural Topic Model (STM), so as to incorporate gender document-level meta- data into a probabilistic text model. This algorithm characterizes jointly the set of latent topics that best fits our data (the set of abstracts) and how the documents/abstracts are allocated to each latent topic. Latent topics are mixtures over words where each word has a probability of belonging to a topic after controlling by journal name and publication year (the meta-data). Thus, the topics may capture research fields but also other more subtle characteristics related to the way in which the articles are written. We find that fe- males are unevenly distributed along the estimated latent topics, by using only data driven methods. This finding relies on “automatically” generated built-in data given the contents in the abstracts of the articles in the T5 journals, without any arbitrary allocation of texts to particular categories (as JEL codes, or research areas). Gender Distribution across Topics in the Top 5 Economics Journals: A Machine Learning Approach∗ J.Ignacio Conde-Ruiz,a,c Juan-José Ganuza,b Manu Garćıad and Luis A. Puchc† aFedea bUniversitat Pompeu Fabra and Barcelona GSE cUniversidad Complutense de Madrid and ICAE dWashington University in St. Louis and ICAE June 2021 Abstract We analyze all the articles published in the top five (T5) Economics journals be- tween 2002 and 2019 in order to find gender differences in their research approach. We implement an unsupervised machine learning algorithm: the Structural Topic Model (STM), so as to incorporate gender document-level meta-data into a probabilistic text model. This algorithm characterizes jointly the set of latent topics that best fits our data (the set of abstracts) and how the documents/abstracts are allocated to each latent topic. Latent topics are mixtures over words where each word has a probability of belonging to a topic after controlling by journal name and publication year (the meta-data). Thus, the topics may capture research fields but also other more subtle characteristics related to the way in which the articles are written. We find that fe- males are unevenly distributed along the estimated latent topics, by using only data driven methods. This finding relies on “automatically” generated built-in data given the contents in the abstracts of the articles in the T5 journals, without any arbitrary allocation of texts to particular categories (as JEL codes, or research areas). Keywords: Machine Learning; Gender Gaps; Structural Topic Model; Gendered Language; Research Fields. JEL Classification: I20, J16, Z13. ∗We thank Antonio Cabrales, Pedro Delicado and Nagore Iriberri for helpful comments, and Elvira Alonso for excellent research assistance. We also thank the Editor and two anonymous referees for their suggestions, as well as session participants at Computing in Economics & Finance Conference, Tokyo (virtual) 2021. José Ignacio Conde-Ruiz and, Manu Garćıa and Luis Puch, respectively, acknowledge the Spanish Ministry of Science and Innovation for financial support through projects PID2019-105499GB-I00 and PID2019- 107161GB-C32. Juan-José Ganuza gratefully acknowledges the financial support from the Spanish Agencia Estatal de Investigación, through the Severo Ochoa Programme for Centres of Excellence in R&D (CEX2019- 000915-S) and the Spanish Ministry of Education and Science Through Project ECO2017-89240-P. †Corresponding Author: Juan-Jose Ganuza, Universitat Pompeu Fabra, Ramon Trias Fargas 27, 08005, Spain; E-mail: juanjo.ganuza@gmail.com 1 Introduction Despite the efforts undertaken for the whole economic profession to fight against discrimi- nation, women are underrepresented in academia. Lundberg and Stearns (2019) make an assessment of the presence of female economists in the profession and they report a very slow improvement in the last two decades. The picture is as follows. In the beginning of this century, 35% percent of PhD students and 30% of Assistant Professors were female. Since then, these numbers have not increased.1 Additionally, Siniscalchi and Veronesi (2020) summarizing Chevalier (2019) (Report of the Committee on the Status of Women in the Economics Profession) point out that the proportion of women assistant professors in the “top 10” schools has declined to less than 20% by 2019. They document also that female have been less successful in promoting to tenured associate or full professors. In Economics, the tenure path often requires to publish in the top five (Top 5, or just T5) journals, namely: American Economic Review (AER), Econometrica (ECA), Journal of Political Economy (JPE ), Quarterly Journal of Economics (QJE ) and Review of Economic Studies (REStud). Heckman and Moktan (2020) analyze the tenure decisions of the top 35 Economics departments in the U.S. and they conclude that T5 publications are a very powerful explanatory variable of the promotion to Tenure. Publishing in a T5 is becoming the main goal of young professors in Economics because their professional career may de- pend on succeeding on this target. In addition, the content published in these journals is also determining the path of research in Economics. As a consequence of these facts the competition to publish in any of these journals have increased in recent years. Card and DellaVigna (2013) analyze the publication records in the Top 5 from 1970 to 2012 showing that the acceptance rate has fallen from 15% (1970) to 6% (2012). They explain this fact as a combination of the increasing number of submissions and a declining number of published papers. Card et al. (2019) further analyze the publication records from two of the T5 jour- nals (the QJE and REStud), together with the Journal of European Economic Association and the Review of Economics and Statistics. They report that the current proportion of 1Boustan and Langan (2019) analyze the performance of women across PhD programs in Economics. They report that in 2017, women were a 32% of entering PhD students in economics, This proportion of women in economics is below many other fields including science, technology, engineering, and mathematics (see also Bayer and Rouse (2016)). 1 accepted papers is 3%. Is the T5 entry barrier harder for women? The answer provided by Card et al. (2019) to this question is ambiguous. On the one hand, these authors do not find any gender biases in the refereeing process, and editors decisions are gender-neutral conditional on the referee advises. On the other hand they find that conditional on ref- eree process, female authored papers end up accumulating more citations in later years.2 A potential explanation for this second result is that journals hold female-authored papers to higher standards. Hengel (2020) uses readability scores and finds that female-authored papers are better written and improve during peer review and as they publish more papers. These results could be related to some “horizontal” features or characteristics of female- authored papers that lead to more citations or better writing standards, but not to higher acceptance rates in the editorial process. As Card et al. (2019) control by research fields (JEL codes), their results may be linked to more subtle “horizontal” differences, for exam- ple, that in the same research field, males choose a more theoretical approach and females a more applied perspective (tend to be more cited or subject to less complicated wording). We use a methodology that allows us to identify these subtle gender “horizontal” research differences in the form of word-level research topics. Several papers have pointed out persistent gender differences in the choice of research fields in Economics. Dolado et al. (2012) analyze the gender distribution of research fields in the Top-50 Economics departments in 2005, and show that women are unevenly distributed across fields. Similarly, Chari and Goldsmith-Pinkham (2017) use data from submissions to the National Bureau of Economic Research Summer Institute (2001-2016), and show that the distribution of female researchers is not uniform across fields. From these, we learnt that women are particularly underrepresented in macro, finance and economic theory, and more prevalent in labor or applied microeconomics fields. Beneito et al. (2018) find similar results using data from the annual AEA meetings from 2010-2016, while Lundberg and Stearns (2019) focus on PhD dissertations in Economics from 1991-2017, in almost all major PhD- granting departments in the United States. Using the JEL code for identifying the research area, they find that women are more prone to study topics in Labor and Public Economics than in Macro and Finance. They also show that this pattern has not changed over time. 2Hengel and Moon (2020) analyze publications in T5 and they also find that female authors published articles are more cited. 2 In this paper, we want to contribute to this literature in two directions. First, we focus on exploring gender “horizontal” distribution across research topics in the leading Economics journals. More importantly, we do so, using a new methodological approach based on Machine Learning techniques. This classifies our abstracts’ database into latent topics. We collect all the articles published in T5 journals for the period 2002-2019. We obtain 5,311 articles, and we keep track for each article of the authors’ names, year of publication, journal and the abstract. With this information, we can provide a very accurate picture of the performance of men and women while publishing in these leading journals. Our goal is to describe what these latent topics are and the distribution by gender across these topics. Second, from the universe of algorithms for topic modelling we implement and develop the Structural Topic Model (STM) developed by Roberts et al. (2019). This choice is because the algorithm allows to incorporate document-level meta-data into a probabilistic text model. Precisely, we keep track of journal names and publication years as covariates to improve the estimation of the prevalence of topics in our data. Our abstracts come from different sources and different periods of time, so it is natural to allow this meta- data to affect the frequency with which a topic appears. The output of the algorithm is a stochastic model that generates latent topics and allocate the documents to them in a probabilistic way. The main advantage of this unsupervised machine learning approach is that “latent topics” are mixtures over words where each word has a probability to belong to the different topics. Therefore, these topics can capture, conditional on covariates and without human intervention, research fields, information regarding the style of writing, methodology, conversational patterns or even different ways of thinking. We start by identifying the number of latent topics for which the stochastic model fits best our data. Our main result is that female are unevenly distributed across latent topics. One key aspect is that female prevalence dispersion is higher across these topics. Moreover, we show that although the proportion of females is slightly increasing among the popula- tion of T5 authors over the years, the identified “horizontal” differences persist. We have computed the empirical distribution of latent topics by gender and we show some striking differences between male and female expected proportions. We want to emphasize the im- portance of these results, not only because latent topics may capture subtle “horizontal” 3 differences, but also because the differences about gender we estimate are “automatically” generated given the documents, without any arbitrary allocations to particular categories (as JEL codes, or declared research areas), and thus, they are possibly more robust. Notwithstanding, the choice of the number of latent topics, even if optimal as we discuss, is subject to clustering issues. Thus, we also choose to reduce the number of topics the algorithm has to generate in order to try to capture the mixtures of words that more closely relate to research areas. There is a trade off when choosing ex-ante the number of latent topics. On the one hand, a relatively high number of topics usually fits better the data. On the other hand, a lower number of latent topics facilitates the broad semantic interpretation of them. In our setting, a lower number of topics turns out to make them closer indeed to traditional research fields. Consistently with our main finding above, we also characterize an uneven distribution of topic/research fields by gender, and very much in line with the existing literature cited above. However, here we can also discuss the link between the existing findings and our class of probabilistic results. In a nutshell, our approach provides complementary evidence from previous literature over “horizontal” research differences between males or females. The estimated larger set of research topics may allow to identify more precisely the gender gaps, and what is more important, may help to understand the driving forces behind these gaps. There are several channels for which the gender differences in the choice of research topic that we identify in this paper can have an impact on the probability of publishing in top journals, earning tenure and in general on career success. Conde-Ruiz et al. (2017, 2021) and Siniscalchi and Veronesi (2020) provide two dynamic mechanisms that may explain how “horizontal” gender differences, together with an initially uneven distribution of gender re- searchers, may generate an unintentional discrimination trap linked with the functioning of academic organizations (journals, departments, etc.). Conde-Ruiz et al. (2017, 2021) analyzes a promotion setting in which workers’ skills are assessed by committees whose members have different abilities to evaluate workers’ signals (they are better at evaluating workers from the same group). This “homo-accuracy” assumption naturally translates to the present academic setting, where promotions and editorial processes are done by “com- mittees” and where evaluators making research in the same research field are able to assess 4 better the underlying quality of the candidate. Under this “homo-accuracy bias”, the group that is most represented in the evaluation committee generates more accurate signals, and, consequently, has a greater incentive to invest in human capital. This gives rise to a dis- crimination trap. If, for some exogenous reason, one group is initially poorly evaluated (less represented into evaluation committees), this translates into lower investment in human capital of individuals of such group, which leads to lower representation in the evaluation committee in the future, generating a persistent discrimination process. Siniscalchi and Veronesi (2020) focus specifically on the academic labor market and point out a similar un- intentional discrimination trap linked to the so-called “self image bias”. Research evaluation is biased towards young researchers with similar characteristics to them. The authors build up an overlapping-generations model with two groups of researchers with equally desirable (but a little bit different) research characteristics and identical ex-ante productivity distri- butions. If one group is slightly over-represented into the evaluation group, this group (and its specific research characteristics) may dominate forever. These theoretical results go in line with the empirical findings of Dolado et al. (2012) that show that the probability for a female researcher to work on a given field is positively related to the share of women already working on that field (path-dependence). The proportions these authors find based on JEL codes are very similar to what we find automatically at the same level of aggregation, but we can set forth a lot more field idiosyncrasy. At the end of the paper we discuss various issues for further research in related applications. The paper is organized as follows: the next section presents the raw data and the descrip- tive analysis of the patterns of publication in T5 journals. Section 3 presents the Structural Topic Model. Section 4 studies the gender differences in the latent estimated topics. Sec- tion 5 extends the model to analyze topics as research fields. Last section concludes and in the Appendix we explore several extensions and provide details about the functioning of the Structural Topic Model (STM) algorithm. 2 Raw Data and Descriptive Analysis We collect the publicly available information from all articles published between 2002 and 2019 in the T5 leading journals in economics, as already indicated: The American Economic 5 Figure 1: Number of Articles Published per Year in T5. Note: Publications exclude notes (without abstract), comments, announcements, and Papers and Proceedings (P&P). Review, Econometrica, The Journal of Political Economy, The Quarterly Journal of Eco- nomics, and The Review of Economic Studies. For each article we collect the information about the journal, year of publication, authors and the abstract of the paper. We have 5,311 articles in total over the period 2002-2019, the average number of papers published in Top-5 journals per year is 295, with a maximum of 351 (on year 2017), and a minimum of 234 (on year 2002). Figure 1 shows that the distribution of published papers by journal is uneven. AER accounts for 34.3% while JPE only represent 13.4% of the sample. AER publishes regular articles as well as shorter papers.3 We include in our sample the shorter papers (as long as they have abstract) since their editorial processes is similar to regular articles. We exclude the articles published in AER as Papers and Proceedings since their requirements and editorial processes are different.4 We want to compare this descriptive information with Card and DellaVigna (2013) who analyze all the articles published in the T5 from 1970 to 2012. They obtain several interesting facts, among them, that the total number of articles published in these journals declined from 400 per year in the late 1970s to 300 per year in 2012. They also show that one journal, the American Economic Review, 3AER stopped publishing shorter papers in 2018. 4In Appendix E we add P&P articles to our data and we replicate the analysis for these extended data. 6 Figure 2: Number of Authors of Published Papers in T5. accounted in 2012 for 40% of T5 publications, up from 25% in the 1970s. In our updated sample, as it is shown in the figure, we find that this trend has stabilized after 2012. Card and DellaVigna (2013) also find that the number of authors per paper has increased from 1.3 in 1970 to 2.3 in 2012. We observe the same trend in the recent years, in particular in 2019 the average number of authors was above 2.5. Figure 2 reports the share of articles by number of authors, one to five or more. Clearly the steepest trend downward is for solo authorship, whereas the three authors case (or even the four authors case) exhibits the opposite pattern. The two authors case share has remained fairly stable over the entire sample at around 40% of articles (base, not augmented). Five or more authors in Economics’ articles at leading journals are still a rare event. Next we move to analyze gender issues. We do not observe directly gender in our data. For solving that problem, we classify authors by gender according to their first name. We rely on three different databases: the first-names’ database published by the U.S. Social Security Administration, created using data from Social Security card applications; the database constructed by Tang et al. (2011), who use Facebook to collect data on first names and self-reported gender; and finally, the names’ database developed by Bagues and Campa (2017). We check manually any candidate who (a) falls within the [0.05 0.95] probability interval of being male/female or (b) cannot be found in any of the databases. We convert the original sample of articles into an articles-authors sample. We transform 7 Figure 3: Number of article-author observations by gender, and the share of female articles. the original 5,311 articles to a total sample of 11,721 (with implied 9,840 articles-men authors, and 1,881 articles-women authors). Except otherwise indicated all measures below are computed over this augmented articles-authors sample. Figure 3 depicts the share of female authors (right axis), which has been steadily increas- ing (with fluctuations) at a rate of 6.2% per year, (compared to men’s share average rate at 3.7%), reaching 20% share during a couple of years in the recent past. Despite female authors are increasing at a higher rate, and that there have been an important improvement in the last decades, women are clearly under-represented in T5 publications. This data is consistent with the data from the report of the Committee on the Status of Women in the Economics Profession, Chevalier (2020). Figure 4 compares the evolution of the share of women in the different professor categories of the top 20 Schools of Economics in the United States in 2020 with the proportion of female authors in Top 5. Notice that the share of female authors is very similar to the 20,4% share of women in the faculty of the top 20 Schools in the United States on 2020. In line with Heckman and Moktan (2020), the rate of increase of female coauthors in T5 seems to be very similar to the rate of increase of female full Professors in these Departments. The average proportion of females that are full professor in Spain and the EU average are very similar5 5See Auriol, Friebel and Wilhelm (2019). 8 Source: CSWEP Report, 2020 and own elaboration. Figure 4: The Pipeline for Top 20 Economcs Departments: Percent and Numbers of Faculty and Students who are Women. We have split the description of the data into two figures, one for single gender groups and another for mixed teams. Figure 5(a) shows the corresponding co-authorships pattern when the set of co-authors are single gender groups. The more salient feature of these data are that, while the share of sole maleauthors has been declining from 30% of total, to slightly above 10%, the share of sole female articles has been stable over the entire sample, at a share close to 5%. We want also to point out that despite the slow decline, two males is the most common co-authors team. The equal share of male-female authors has been fairly stable at about 12% (92.7% of these articles are, in particular, one male-one female). Alternatively, the share of articles with at least one woman and at least two men has been increasing from nearly 5% over total to around 14%. Thus, the strongest trend in data seems to be associated to the participation of female authors in articles with more male authors. 9 (a) Percentage of T5 articles coauthored by single gender teams. (b) Percentage of T5 articles coauthored mixed gender teams. Figure 5: Co-authorships patterns in T5 journals. 10 Figure 6: Distribution of number of T5 papers published by gender. Figure 6 shows the distribution of the number of published papers by gender. Condi- tioning on having published in T5 journals, females are more likely than males to publish only one or two papers, while the proportion of authors that have published more than three papers is greater for males than for females. Clearly though, more than 80% of either female (15% of the distribution) or male authors have published less than two T5 over the last 20 years. This is an important fact for understanding the role of superstars in the profession as well as the formation of networks of coauthors. 3 The Empirical Model: Structural Topic Model (STM) Our empirical strategy is to use unsupervised machine learning techniques to uncover the hidden structure of our text documents.6 By unsupervised we denote the absence of human intervention in order to identify the latent topics behind the abstracts of articles published in the T5 journals during the period 2002-2019. For us, an abstract is a set of words and these words have different probabilities to belong to one or several latent topics. Informally, when 6For an excellent non technical introduction to machine learning, see Hansen et al. (2017) 11 we are writing on a particular topic there are words that are used more often than others. Our objective is to provide a low-dimensional representation (topics) of a high dimensional object (abstracts) while retaining as much as possible its informational content. The baseline for topic modelling is the LDA algorithm (Latent Dirichlet Allocation) developed by Blei et al. (2003) and also the most popular machine learning algorithm in reducing the dimensionality of text documents.7 In this paper, we use an algorithm called STM (Structural Topic Model) developed by Roberts et al. (2019), which can be understood as a refinement for this LDA algorithm. This topic model is said to be structural because it allows the use of “covariates” to inform about the structure (partial pooling of parameters). These covariates in our case are going to be the different journal names and the different years in the sample. The idea is to better capture along these dimensions the changing relationship between words in abstracts and the latent topics. Next we want to explain the algorithm and the outcome variables, and in Appendix A we provide a more technical discussion over STM and LDA. We start by describing the inputs. We have our 5,311 abstracts (or documents) to extract all the words. First, we have to “clean” this set of words in order to reduce the vocabulary and select terms with more informational content. This helps us for a better estimation of more semantically meaningful topics. The corpora is the set of unique words that we obtain, after converting to lower case and remove from the original raw text common stop-words,8 as “for” or “in”. Also, we prune the words until we get their original linguistic root (”educ” instead of ”education”), and eliminate the words that appears one or two times only.9 In our case, we start with a set of 13,835 different terms and end up in a corpora of 4,241 of unique words. The second step is to represent our text data in a document-term matrix of D rows (5,311 abstracts) and V columns (4,182 unique words in our corpus) where the element (d, v) of the matrix is the number of times the vth unique word appears in the dth abstract. This document-term matrix that reduces the dimensionality of our original text variables is the input of the algorithm. Our objective is to find a probabilistic topic model that is able to 7For technical description of the LDA algorithm, see the original article of Blei et al. (2003) and also Hansen et al. (2017) that is the first paper that uses the LDA algorithm in the economic literature. 8In particular, we remove the stop-words from the SMART list, developed at Cornell University in 1960. 9See Appendix B for the details of this pre-processing. 12 explain the document-term-matrix in two additional steps. First by identifying K topics in our corpora and then by representing documents as a combination of those topics. What is a topic? The topic k is a probability distribution βk over all the unique words of our corpus, where βv k is the probability that topic k generates word v. Each document d has its own distribution over the set of topics θd. This captures that each document/abstract can refer to several topics. Then, θkd would mean the weight of topic k in document d. The probabilistic topic model is described by these topic βk and document θd distributions. Given that, we can compute the probability that an arbitrary word in the document d coincides with the vth term is pdv = ∑ k β v kθ k d . Using these probabilities, we can obtain the total likelihood of our data, ∏ d ∏ v p nd,v d,v , where the nd,v corresponds to the elements in the document-term matrix (the number of times the vth unique word appears in the dth abstract). This total likelihood is our “objective” function. In a nutshell, The LDA and the STM algorithms are designed for finding numerically the stochastic model of latent topics (the distributions βk and θd) that better suit our document-term matrix, that is that maximizes this total likelihood. We are going to skip here further details on the algorithms we use, and we refer the interested reader to the appendix A (and also to Roberts et al. (2014)). However we want to make two important observations. First, as indicated above, we are implementing STM instead of LDA. The main ad- vantage of STM for our data is that we can use very relevant covariate information about our documents in order to improve parameter estimation.10 In particular, for each docu- ment/abstract we interact the year of publication as well as the journal name. We take advantage of the variability of the abstract along the time and across journals for improving the estimation of our stochastic model in particular of the distribution θd). The second important observation refers to the determination of the number of topics. We can follow two strategies. One, it is to find the number of topics that better fits the data, which usually leads to a large (optimal) K. The alternative is to force the algorithm to use a given number of topics for facilitating the interpretation of those. For our baseline analysis we use the first approach and we work with 54 topics, but we also pursue the estimation of 9See Hansen et al (2018) for a precise description of the computation of the total likelihood. 10In Cabrales et al. (2018) there is an attempt to impute also gender as an additional covariate for the articles published in the British press by looking for female names in the body text of this articles 13 our stochastic model using a fixed number of topics to facilitate comparison with the results in existing literature. Previous literature, using JEL codes (for example, in Card et al. (2019)) or research areas in top departments (for example, in Dolado et al. (2012)) have concentrated in a broad definition of topics as fields of research, say, Labor or Econometrics. However, the unsupervised learning methodology we use allow us to go beyond pre-labelled research areas so as to capture more subtle differences, such as writing style, particular methodologies, or the variation in research questions. For example, our methodology allow us, when identifying latent topics, to separate two papers of labor economics, but one more applied and other with a theoretical contribution. We consider our approach a promising tool to analyze if there are horizontal gender differences in economics research, that is, whether or not male and female write different articles even within the same research field. For this reason, in the next section we will analyze our stochastic model with K = 54 topics, while in Section 5, we will be focusing on estimating our stochastic model with K = 15 topics. In addition to these two exercises, in the appendix we extend our original sample for including the abstracts of 1,117 articles published as Papers and Proceeding in AER, between 2011 and 2018 (before 2011 these types of papers do not have abstracts and after 2018 are published in a different journal). We will show that for this extended sample the optimal number of topics increases to K = 70. While we have preferred to exclude these papers of the main baseline analysis because these are very short papers with very different editorial processes than regular submissions, this extended sample generates interesting new insights. 4 Gender Differences in Latent Estimated Topics As we said above the number of topics that best fits the text data is 54.11 We estimate probabilities for each document to belong to this set of built-in latent topics using the Structural Topic Model. The STM output is summarized by the latent topics displayed in Figure 7 that shows the key words associated to each of the 54 topics. The words within each row are ordered left to right by the probability they appear in each latent topic. Eventually, we could assign some labels to latent topics, based on well known fields names in Economics. 11In Appendix C we provide a formal discussion about the optimal number of topics. 14 For instance, we can associate the more prevalent topic in the sample in expectation, topic 28, to international trade. Likewise, the second more prevalent topic in the distribution, topic 9, may be associated to Econometric Theory. However, this is not the goal of the analysis as we have indicated above. The important thing is that latent topics may be related to something beyond research fields, as methodology or style of writing. These latent characteristics hide gender differences too. 4.1 Topic Prevalence Once we have identified the estimated latent topics, we can analyze how our documents/ abstracts are distributed among them. In allocating an abstract to a particular topic we consider our underlying θd distribution. Then we assign document d to different topics with different probability weights. Following this approach, Figure 8 shows latent estimated topics in a way that also illustrates the number of documents in each topic, notice that in Figure 8 the size of the circle is proportional to the expected number of documents in the topic (we have also reproduced numerically this information in a column in Figure 7). As we cannot make a mapping of our 54 topics to particular fields of research, it is difficult to interpret the information of Figure 8 regarding the size of the topics. For example, topics 11, 9 and 21, in Figure 8 are related to “Econometric Theory”, and are relatively large compared with other topics. However, if the algorithm would have introduced more topics within “Econometric Theory”, each topic would have had a smaller mass, the weight of the research field being the same. In other words, our perception of the successful topics is affected by how the research field is split into topics. Figure 8 also contains information over the connectedness between topics. For example, if the latent topic k is closer to k′ than k′′, it means that the distribution βk is more alike to the distribution βk′ than to distribution βk′′ . Looking at Figure 7 and the description of the latent topics in Figure 8, some interesting patterns arise. For example, the previous discussed topics 11, 9 and 21 (“Econometric Theory”) are in someway isolated from the rest of topics. In Figure 8 we can also identify some other clusters of topics, for example (East in Figure 8) 51,34, 23, 2, etc are topics related to Macro-Finance, closer to those in Econometric theory, but not that much; (West in the Figure 8) 50 is a central node of a set of topics 15 import qualiti project rule conflict search vote save rate inform group firm econom debt tax work public bank social state auction mechan inform technolog market capit incom optim cycl household region percent consum risk financi polici contract return welfar market women price game belief shock equilibrium wage school test prefer experi condit estim trade use delay effort demand increas unemploy news citi exchang vote ethnic contract studi default reform program regul credit network unit bid implement coordin innov match human earn alloc busi hous econom health firm avers invest polit agent firm cost inform children cost player agent polici dynam worker student statist choic subject variabl method countri addit probabl team set violenc worker voter retir interest signal member ownership name borrow incom labor enforc polici individu right bidder incent action product stabl invest inequ effici product consumpt area insur product consumpt constraint govern princip stock benefit trade parent adjust strategi expect monetari general employ effect asymptot decis experiment function sampl product data accept perform ration crime job media account currenc voter trust vertic correct govern rate suppli good fund incent issu buyer transfer communic new friction skill data distort industri spend local increas demand ambigu recess parti commit manag insur price femal chang payoff prior inflat equilibria firm educ distribut util behavior identif data export sever fee redistribut problem war distribut candid popul countri aggreg evid integr bias credit increas hour privat crisi interact econom seller type strateg firm competit growth differ economi fluctuat incom growth hospit market util shock elect optim asset gain asset men data equilibrium ration aggreg exist product colleg method individu treatment identifi asymptot intern relat order outcom yield polic durat elect life real bias segreg adopt black bond taxat increas law lend opportun protect valu design payoff patent labor accumul measur privat chang expenditur land estim good discount asset voter hazard equiti polici valu famili firm play probabl respons economi job score paramet make predict restrict paramet import support card win constitut outsid wage estim increas patient privat countri industri measur fiscal margin transfer provis liquid depend problem price compat game research agent differ survey condit demand increas agricultur care price prefer firm power incent investor estim trader educ demand bargain signal money condit increas test confid altern learn estim consist firm analys offer competit optim option rate committe german donor strateg increas cost data sovereign chang time punish loan connect institut revenu post outcom adopt labour labor distribut ineffici volatil effect locat patient profit expect aggreg politician moral portfolio loss privat marriag good repeat util real stochast labor teacher propos behavior evid distribut use sector find paper one function effect employ newspap individu regim elect cultur supplier signific market optim particip legal financi link properti privat agent sender knowledg side account use resourc aggreg respons develop drug advertis asset credit elector inform predict reduc advers child markup cooper set nomin solut plant program forecast set theori instrument error factor limit higher prize util attack benefit bias rate transplant larg chang exclus racial matur effect home cost market secur resourc inform problem signal spillov type life mobil polici entri data data use competit intertempor financ public problem size use select birth relat equilibria learn volatil uniqu skill assign bootstrap maker differ bound bias develop 18.7% 17% 14.5% 15.3% 14.8% 10.4% 17.5% 14.7% 13.4% 10.9% 15.5% 19.9% 18% 10.5% 18.7% 10.1% 14.5% 14.4% 17.8% 16.9% 15% 14.1% 16.8% 14.8% 19.5% 15.6% 17.4% 17.8% 17.5% 21.8% 19.4% 11.6% 18.4% 15.6% 17.1% 15% 10.8% 19.8% 13.6% 14.1% 22% 18% 15.1% 14.9% 14.4% 19.4% 15.1% 15.4% 32.8% 13.5% 13.2% 14.5% 15.1% 15.7% 1.3% 2.3% 2% 1.5% 1.6% 2.3% 1.2% 1.2% 3.5% 1.6% 3.3% 1.5% 1.4% 1% 2.2% 2.3% 1.5% 1.9% 2.6% 1.4% 2.7% 2.7% 1.4% 0.8% 1.7% 1.7% 1.7% 3.8% 2.8% 1.3% 1.2% 2.1% 2.2% 2% 1.1% 1.3% 2.5% 1.3% 1.2% 1.8% 1.9% 1.5% 1.8% 1.6% 1.8% 0.9% 1.9% 2.6% 2.2% 2% 2.4% 1.7% 2.2% 0.3%Topic 54 Topic 24 Topic 46 Topic 14 Topic 35 Topic 8 Topic 7 Topic 31 Topic 39 Topic 36 Topic 38 Topic 30 Topic 1 Topic 23 Topic 20 Topic 13 Topic 42 Topic 17 Topic 12 Topic 4 Topic 5 Topic 10 Topic 44 Topic 25 Topic 26 Topic 52 Topic 27 Topic 40 Topic 45 Topic 43 Topic 18 Topic 41 Topic 47 Topic 3 Topic 34 Topic 50 Topic 32 Topic 33 Topic 15 Topic 53 Topic 49 Topic 2 Topic 6 Topic 16 Topic 51 Topic 37 Topic 48 Topic 19 Topic 21 Topic 22 Topic 29 Topic 11 Topic 9 Topic 28 Topic Prop. Female Prop. 5 10 15 20 Word Prevalence (%) 10 20 30 Topic Proportions (%) (White = median Female Prop.) Figure 7: Optimal K Topics Ranked by Prevalence in the corpus. related to Political Economy and Institutions), (South-West in Figure 8) 29,32,22, etc., are topics related to Microeconomics (contract theory, decision theory, etc.). Finally, applied areas as labor, international-development, or public economics are located around topics 19, 16 Figure 8: Connectedness between topics and the fraction documents/abstracts in each topic (θd distribution). 49, 28, and 48 (north in Figure 8). In Appendix D we undertake a more formal analysis of the distance between topics using a Simple Correspondence Analysis of the probability matrix for documents to belong to the different latent topics. We find the corpus organized along two dimensions: Dimension 1 can be interpreted as going from Applies to Theory, whereas Dimension 2 goes from, say, Economics to Econometrics. 17 Figure 9: Connectedness between topics and the female authors documents/abstracts in each topic. Using our classification of authors’ names by gender and the allocation of documents to latent topics, we can build up a similar figure with information about the gender distribution. Figure 9 shows latent topics where the sizes of circles are proportional to the percentage of female authors working in such topics (we have also reproduced numerically this information in the last column in Figure 7). Figure 9 provides interesting evidence of the main message of this paper, male and female display different patterns when doing research. Independently of the grade of under- representation of women in the profession, if there were not significant gender horizontal 18 differences we would expect that sizes of latent topics measure for the proportion of females were similar. On the contrary, we observe an uneven distribution of sizes. There is a small subset of topics (North in the figure 9), specially topic 49, with a relative high proportion of females, that moreover seem to be closely connected (according to the terminology for applied economics fields). On the contrary, there is other set of topics (for example South-West in Figure 9) that are also closely connected and where the presence of females is scarce (around terms common to economic theory research questions). 4.2 Topic analysis and the gender distribution As we said above, it is difficult to describe the precise semantic meaning of the latent topics when we are working with K = 54. We are able, however, to look closer to the latent topics where females are more or less prevalent and its potential implications. In particular, Figure 10 shows that the latent topic with the highest proportion of female authors is topic 49 (32.8% as indicated in Figure 7). On the contrary topic 16 turns out to be the topic with the lowest proportion of females (10.1% as indicated in Figure 7). As a simple illustration, Figure 10 represents these topics as word clouds, where the size of terms in the cloud is equivalent to its probability in the latent topic distribution βk. (a) Topic 49 (highest prop. of female authors). (b) Topic 16 (lowest prop. of female authors). Figure 10: Topic Word Clouds: Topic 49 vs Topic 16 19 The words that seem to be more prominent in the cloud 49 are women, men, parent, children, health, etc. These words could be easily linked to research fields, as gender or health economics, traditionally associated to women. Similarly, the word cloud of topic 16 seems to be related to Micro theory that has been often labeled (while not statistically) as an area where there are less female than average. Latent topics may differ in other dimensions beside semantic content. For instance, Hengel (2020) uses readability scores to measure the quality of writing of article abstracts.12 We have implemented E. Hengel’s Python module Textatistic to compute readability results over the article abstracts across our latent topics. The finding is that scores across more female topics are better rated than across more male topics. However, it is hard to disentangle the role of the prevalence of female authors face to face the wording within a topic. Moreover, scores that are outliers should be properly treated to ease comparisons. We leave the study of these readibility issues implying fundamental gender differences for further research. Rather, Figure 11 shows the mean of the presence of women authors by topic, together with the standard deviation of this presence over the sample of years. For some latent topics the proportion of females is larger than the average (which is 15, 9% over the period 2002- 2019), reaching a proportion of 33% for topic 49. On the contrary, females are specially underrepresented in other topics, as topic 16, with only a 10%. Dispersion over time differs also across topics, and it seems that is higher for topics with higher proportion of females (the correlation between dispersion and the proportion of females is 0.35). While it is true that the proportion of female authors has been increasing in the last two decades from around 13% on 2002 to 19% on 2019, we do not see a trend in the dispersion of the proportion of females by topic. Consequently we see the prevalence of females across topics as a signal of gender “horizontal” differences in research.. Nevertheless, for having a more accurate picture of this “horizontal” differences, we need to add the information regarding the relative prevalence of the topics. It could be possible that females are unrepresented in a particular topic, and this circumstance having little impact as far as this topic contains very few published papers. 12As E. Hengel discusses in detail, abstract readability is strongly positively correlated with the readability of other sections of a paper. 20 Figure 11: On the presence of women, by topic: mean and one standard deviation across time. Figure 12 shows the distribution between males and females across topics normalized for having the same size. This gives us the propensity that, say, a female authored paper belongs to any of the 54 topics. We rank the topics according to probability of being chosen by a male author. This figure provides evidence that male and female authors either have different preferences or follow different strategies when pursuing and publishing their research. We observe that topics with higher “demand” by males are also highly demanded by females. However, there is a set of topics, for which the proportion of published papers for men are high, which are less attractive (o more difficult to publish) for females. In general, male and female distributions are different, with the salient feature of topic 49 for females, that it is a clear spike in the female distribution of published papers. We confirm this evidence with a complementary Figure 13 representing the dispersion of published female authored papers across topics, but accounting also for the prevalence of latent topics. In particular, for each topic we have the proportion of published papers by female authors (taken from Figure 12) minus the proportion of published papers in this topic overall. Conditioning on having published a paper, male and female would be equally 21 Figure 12: Empirical distributions across topics between males and females (conditional of having published an article in Top 5). 22 likely to publish a paper in a specific topic, this difference would be zero. Then, we can interpret this difference as the excess propensity to publish a paper in a particular topic by females. These differences can be positive or negative, and the sum over all topics is zero. The figure shows that there are topics for which the propensity of publishing papers by females is higher than males, and the opposite. Again topic 49 but also topics 41 (health) and 30 (applied IO) are in one side. While theory topics as 16 or 37 are in the other side. In order to analyze the pattern of coauthor-ships we have pooled the articles in three groups, papers written by male authors, by female authors, and gender mixed team of authors. The main results are summarises in Figure 14 that shows that there is a important difference between the pattern of latent topics between sole male teams and sole female teams, while mixed teams generate an intermediate distribution over the latent topics. Finally, we want to address a related but different question, how male and female diver- sify across topics. For example, when writing an article, an author may contribute to a single latent topic or several, authors that have published several papers may have written similar articles or they could have been more diverse: are these diversification patterns different for males and females? For addressing this question, the first step is to choose a measure of latent topic dispersion/concentration. A natural candidate is the Herfindahl-Hirschman Index (HHI) that is used to measure the concentration in a market. The HHI index is calculated by squaring the market share of the firm (the topic) that compete in a single market and then summing up the resulting numbers HHI = ∑N i=1 s 2 i . We apply this index to our problem as follows. For each author (the market), we identify all the latent topics that she has contributed to (the firms). For each article the algorithm computes a probability distribution over the latent topics. We repeat the process for all articles of the same author. Then, the cumulative probability divided by the number of articles is the contribution of the author to this particular latent topic (the market share, si). For example, if an author publishes very similar papers related to a single or a few latent topics, her HHI will be high. On the contrary, authors with a more diverse research agenda will have a lower HHI. Figure 15 shows the corresponding average HHI for males and females. We have computed the HHI controlling for the number of papers by author. It is clear, 23 Figure 13: Relative propensity of publishing papers by females over topics. 24 0 1 2 3 4 5 6 7 8 9 54 24 46 31 35 38 7 14 8 30 39 36 13 23 1 20 12 42 49 41 17 25 4 5 27 26 44 52 10 43 40 45 47 18 3 33 34 15 50 2 53 32 19 51 48 29 6 16 21 22 37 11 28 9 Male Female Mixed Figure 14: Empirical distributions across topics between males, females and mixed au- thorship (conditional of having published an article in Top 5). 25 Figure 15: Diversify across latent topics by gender (HHI). that an author that has published more papers is likely to have contributed to a larger set of latent topics and therefore she must have a lower HHI. Interestingly, the figure shows some differences between genders in terms of diversification. Females are more diverse (lower HHI) when publishing one or two papers, but less (higher HHI) when publishing a larger number of papers in the Top 5.13 5 Topics as Research Fields In this section we estimate the stochastic model with a lower number of topics, with two objectives. On one hand, a low K facilitates the semantic interpretation of topics and then to analyze, for instance, whether or not, the weight of a particular field in the T5 has increased over time. On the other hand, a low number of topics will allow us to frame our results with previous literature that has used a small number of categories linked to JEL codes and research areas in top departments. After estimating the model for a range of K ∈ 10, ...., 20, we have found that K = 15 is a number of topics for which the estimated 13The HHI is a first approximation as measure of research diversification. In the future, we want to improve the measure by taking into in account that some latent topics are close to others. 26 model performs better in terms of fitting to the data and the semantic content of the latent topics at the same time. The model with K = 15 latent topics is summarized in Figure 16. technolog effect polit wage polici social agent price increas countri equilibrium prefer asset product estim innov school vote worker tax experi contract market household growth game choic financi firm test invest treatment voter labor rate individu optim consum percent econom inform decis invest trade distribut regul program polici employ welfar group effici inform insur incom player util bank industri condit adopt student elect market monetari perform mechan competit health state equilibria risk risk import paramet institut test govern job optim network alloc cost estim data payoff expect return cost method right outcom parti earn govern incent incent demand hous capit action subject market data function resourc random candid unemploy inflat inform problem auction women across learn theori credit countri use increas assign crime increas respons effort match good incom citi signal behavior rate sector variabl enforc use power skill shock manag condit profit children develop belief individu debt export asymptot 16.3% 17.3% 15.4% 18.8% 15.3% 11.3% 16.6% 16.1% 14.3% 12.7% 17.6% 18.2% 23.4% 14% 14.7% 7% 8.4% 7.9% 4.5% 6.9% 7.2% 4.6% 5.3% 10.3% 6.8% 4.6% 6.2% 6.9% 7.9% 5.5% Topic 4 Topic 11 Topic 7 Topic 8 Topic 15 Topic 12 Topic 10 Topic 5 Topic 13 Topic 1 Topic 6 Topic 14 Topic 3 Topic 2 Topic 9 Topic Prop. Female Prop. 2 4 6 8 Word Prevalence (%) 5 10 15 20 Topic Proportions (%) (White = median Female Prop.) Figure 16: Latent topics ranked by prevalence in the corpus with k = 15. 27 Figure 17: A topic with “labor”: topic 8 in the set with K = 15 The reader may then wonder what additional information is contained in the unrestricted version of the Structural Topic Model (STM). One way to illustrate on the importance of an adequate selection of the number of topics is to explore in detail the composition effects we already discussed above. We proceed as follows. First, we consider the stem “labor”, and we look for it among the fifteen more frequent words within the restricted version of the STM, that is, the version with just 15 latent topics (K = 15). We only find that particular word under the required frequency within topic 8 in Figure 16. Figure 17 depicts the word cloud for that topic 8 in the restricted version of the model with K = 15. Clearly, in this particular case, one may say this cloud describes well the research field corresponding to JEL code J, which is, Labor and Demographic Economics. The key idea with the Structural Topic Model is that a field like, ”Labor”, can fit many research lines in the unrestricted version of the model, in our case the one with 54 latent topics. When we look for the stem ”labor” within the 54 latent topics, we find it among the fifteen more frequent words in as many as six topics. Figure 18 illustrates on the most prevalent among these topics which are: Labor Search, Labor Supply, Human Capital, or Productivity Analysis. Notice, in particular, that there are important differences on the prevalence of females across these different subtopics, from 18 per cent in the more policy oriented topic which is “labor supply” to 14 per cent in the more theoretical “labor search” (go back to Figure 7 for these shares). Important variability can be washed out when the 28 (a) Labor Search (14% fm) (b) Labor Supply-Policy (18% fm) (c) Human capital (14% fm) (d) Productivity (16% fm) Figure 18: Word clouds for topics with the stem “labor” among the fifteen more frequent words in the set with K = 54 methodology used account for the research field environment rather than for the research topic environment. As we have anticipated, the reduction of the number of topics to K = 15 makes easier to label the latent topics as meaningful research fields, though. Following our previous analysis, Figure 19(a) plots the latent topics showing the relative semantic distance between topics 29 as well as their weight in terms of the fraction of documents/abstracts that they contain. If we compare Figure 7 (with K = 54) and Figure 19(a) (with K = 15), they have a similar “geography” in terms of general areas of knowledge. Therefore, similar patterns in terms of the distances between topics arise. For example, “Econometric Theory” seems to be isolated, whereas applied fields as Labor and Public Economics, are closely connected. Figure 19(b) (as Figure 8 with K = 54) provides evidence of the “horizontal” differ- ences between males and females in doing research. The results go in line with the previous literature as in Dolado et al. (2012), Chari and Goldsmith-Pinkham (2017), Beneito et al. (2018) and Lundberg and Stearns (2019) that point out that females are unevenly dis- tributed across fields. We concur with previous literature that females are over-represented in Applied-Micro fields, specially Health-Gender, Experimental and Education and under- represented in Econometric and Economic Theory fields, Macro-Monetary and Finance. For example, Dolado et al. (2012) use the classification of women by research areas (JEL 20 fields) in the top 50 economic departments in 2005. The proportions they find are very similar to ours: i) I-Health, Education and Welfare, 25%, ii) D-Microeconomics, 14%; iii) J-Labour and Demographic Economics, 15% or iv) C2-Econometrics, 14.3%. In our analysis we found that the percentage of female authors are, for example: i) Health and Gender, 23%; ii) Decision Theory (13.6%), Game Theory (11.4%); iii) Macroeconomics and Monetary, 14.2%; or iv) Econometrics, 14.4%. Having said that, the distribution of the proportion of females across these restricted topics seems to be slightly less disperse than those identified in the previous literature with other sources of data. This can be due to the fact that our methodology is more “continuous” than allocating females to fixed categories, and as far as the probabilistic model allocates females’ articles to latent topics with statistical weights. Figure 20 analyzes together the evolution of the prevalence of the topics and the pro- portion of females authors. For building this figure, we have computed the growth rate of topics’ prevalences and topics’ female proportions from the averages in the latest seven years (2013-2019) and the first seven years (2002-2008) of the sample. First, we can observe that the proportion of females have increased in all topics but Finance (−6.6%). Regarding the prevalence, only four topics have decreased their weight in terms of prevalence, Mecha- 30 (a) Connectedness between topics and the fraction docu- ments/abstracts in each topic (θd distribution). (b) Connectedness between topics and the female authors doc- uments/abstracts in each topic. Figure 19: Connectedness for K = 15 nism Design (−10.3%) , Econometrics (−29%), Game Theory (−22.5%) and Experimental (−8.4%). On the one hand, the topics where the percentage of women authors have risen 31 Figure 20: Growth rates of prevalence and female proportion by topics. more are Political Economy (+67.7%), Decision Theory (+42.5%), Macroeconomics and Monetary (+32.3%), Experimental (+40%) or Labor (+35%). In all of them the women were clearly underrepresented. On the other hand, the topics where the percentage of women has grown the least, besides Finance, have been Health and Gender (+11.4%), Econometrics (+9.4%), and IO (+9.2%)). Finally, there is no clear relationship between the growth rate of topic prevalence and the increase in female prevalence. This is surprising. We do not have data about the seniority of authors, but as the proportion of female is increasing, we can expect that the proportion of females among the new entrants in the T5 market should be relatively large. New entrants should be more likely to work in “hot” topics rather than in declining ones. The combination of both effects should lead to a positive correlation between the increase in the prevalence of a topic and the increase in female representation, something that we do not observe clearly in the data. However, another alternative explanation to the increase of the proportion of women in some topics is that females that already have published in top five in the past, have extended their network of male coauthors and getting more papers published. 32 6 Conclusions Using unsupervised machine learning techniques and a new data base composed by the abstracts of all articles published in T5 journals in Economics for the period (2002-2019), we have shown that there are persistent and significant horizontal differences in the way males and females approach research in Economics. Using the Structural Topic Model we have identified latent topics for which the distribution of female authors is more uneven than with research fields. These findings are important for several reasons, because: i) T5 publications are key for research careers and also for determining the path of economic research; ii) The results are robust in the sense that they are automatically generated with a probabilistic model without any deterministic allocation of papers to pre-established categories or fields of research; iii) Finally, recent theoretical results by Conde-Ruiz et al. (2017, 2021) and Siniscalchi and Veronesi (2020) show that “horizontal” gender differences in the choice of research topic may lead to a gender discriminatory trap. Beyond the scope of the present paper, we plan to extend our analysis in several di- rections. Firstly, we want to recollect more information about the authors, in order to be able to capture dynamic effects. For instance, we want to differentiate between the research patterns by senior and junior authors. We want also to investigate how male and female build the network of coauthors and how this process determines the choice of latent top- ics. Secondly, we want to show the usefulness of the methodology and the latent topics we have identified by reviewing research questions analyzed by previous literature in academic gender gaps. For example, Hengel (2020) analyzes the differences in quality of writing of papers. She shows that female-authored manuscripts are better written and concludes that female are subject to higher writing standards. The reason might be an unwelcome gendered culture through the entire editorial process at the time of deciphering complicated texts. We are currently implementing Hengel’s readability scores methodology to the latent topics. Our preliminary findings suggest that those papers belonging to topics with more prevalence of females are better written. Although, this evidence can be interpreted as supporting the view that female-authored articles are better written than equivalent articles by men. It can be also the case that the results are driven by the particular topics. In other words, we need a deeper econometric analysis to disentangle if the written quality of the papers is 33 driven by gender of the author or by the choice of the latent topics. Likewise, Card et al. (2019) shows that female authored papers have more citations, sug- gesting that journals hold female-authored papers to higher standards. They have obtained this result controlling for research field. We plan to collect data on citations and review this result but controlling by latent topic. Finally, we want also to use algorithms (for example, LASSO a widely used regression analysis machine learning method) for testing if the dif- ferences between gender research patterns are important enough, for building a predictive model of gender given an observed abstract. 34 References Bagues, Manuel and Pamela Campa, “Can Gender Quotas in Candidate Lists Empower Women? Evidence from a Regression Discontinuity Design,” 2017, (12149). Bayer, Amanda and Cecilia E. Rouse, “Diversity in the Economics Profession: A New Attack on an Old Problem,” Journal of Economic Perspectives, Nov. 2016, 30 (4), 221–42. Beneito, P., J. E. Boscá, J. Ferri, and M. Garćıa, “Women across Subfields in Economics: Relative Performance and Beliefs,” Fedea WP, June 2018, (2018 - 06). Blei, David M., Andrew Y. Ng, and Michael I. Jordan, “Latent Dirichlet Allocation,” J. Mach. Learn. Res., March 2003, 3 (null), 993 – 1022. Boustan, Leah and Andrew Langan, “Variation in Women’s Success across PhD Pro- grams in Economics,” Journal of Economic Perspectives, February 2019, 33 (1), 23–42. Buckley, Chris, “Implementation of the SMART Information Retrieval System,” Technical Report, USA 1985. Cabrales, A., M. Garćıa, and L. A. Puch, “Gendered Language in the British Press,” Mimeo COSME Gender, at 2018 Meetings of the Spanish Economic Association, 2018. Card, David and Stefano DellaVigna, “Nine Facts about Top Journals in Economics,” Journal of Economic Literature, March 2013, 51 (1), 144–61. , , Patricia Funk, and Nagore Iriberri, “Are Referees and Editors in Economics Gender Neutral?*,” The Quarterly Journal of Economics, 11 2019, 135 (1), 269–327. Chari, Anusha and Paul Goldsmith-Pinkham, “Gender Representation in Economics Across Topics and Time: Evidence from the NBER Summer Institute,” Working Paper 23953, National Bureau of Economic Research October 2017. Chevalier, Judy, “The 2020 Report of the Committee on the Status of Women in the Economics Profession,” 2020. 35 Conde-Ruiz, J. Ignacio, Juan-José Ganuza, and Paola Profeta, “Statistical Dis- crimination and the Efficiency of Quotas,” Fedea Working Papers, 2017. , Juan José Ganuza, and Paola Profeta, “Statistical Discrimination and Commit- tees,” Fedea Working Papers, February 2021, (2021-06). Dolado, Juan, Florentino Felgueroso, and Miguel Almunia, “Are men and women- economists evenly distributed across research fields? Some new empirical evidence,” SE- RIEs: Journal of the Spanish Economic Association, September 2012, 3 (3), 367–393. Hansen, Stephen, Michael McMahon, and Andrea Prat, “Transparency and De- liberation Within the FOMC: A Computational Linguistics Approach,” The Quarterly Journal of Economics, 10 2017, 133 (2), 801–870. Heckman, James J. and Sidharth Moktan, “Publishing and Promotion in Economics: The Tyranny of the Top Five,” Journal of Economic Literature, June 2020, 58 (2), 419–70. Hengel, E., “Publishing while Female. Are women held to higher standards? Evidence from peer review,” Cambridge Working Papers in Economics 1753, Faculty of Economics, University of Cambridge December 2020. Hengel, Erin and Eunyoung Moon, “Gender and quality at top economics journals,” Working Papers 202001, University of Liverpool, Dept. of Economics February 2020. Lundberg, Shelly and Jenna Stearns, “Women in Economics: Stalled Progress,” Jour- nal of Economic Perspectives, February 2019, 33 (1), 3–22. Mimno, David, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum, “Optimizing Semantic Coherence in Topic Models,” 2011, pp. 262 – 272. Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley, “stm: An R Package for Structural Topic Models,” Journal of Statistical Software, 2019, 91 (2), 1–40. , , , Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand, “Structural Topic Models for Open-Ended Survey Responses,” American Journal of Political Science, 2014, 58 (4), 1064–1082. 36 Siniscalchi, Marciano and Pietro Veronesi, “Self-image Bias and Lost Talent,” De- cember 2020, (28308). Tang, Cong, Keith Ross, Nitesh Saxena, and Ruichuan Chen, “What’s in a Name: A Study of Names, Gender Inference and Gender Behavior in Facebook,” in “Xu J., Yu G., Zhou S., Unland R. (eds) Database Systems for Advanced Applications Lecture Notes in Computer Science, vol 6637,” Springer Berlin Heidelberg, 2011, pp. 344 – 356. 37 Appendix A The topic Model We implement and develop the Structural Topic Model (STM) to incorporate document- level meta-data into a probabilistic text model. The topic model is said to be structural because “covariates” inform about structure (partial pooling of parameters). We keep track of journal names and publication years as covariates to estimate the prevalence of topics. The starting point to understand the STM probabilistic model is the LDA (Latent Dirichlet Allocation) generative model. According to LDA, the Data Generating Process for document d ∈ D assigns terms in vocabulary V to positions Nd in the document-term matrix, where the element (d, v) of the matrix is the number of times the vth unique word appears in the dth abstract. The algorithm follows the steps below 1. Draw a K-dim Dirichlet vector θd containing the expected fraction of words in d attributed to topic k ∈ K. 2. For each word (position) in d, sample the indicator zd,n from MultK(θd, 1) that indi- cates the position n associated to a topic. 3. Sample the indicator wd,n from MultV (Bzd,n , 1), where matrix B has distributions βk over vocabulary V; [βk] is frequency with which terms are generated from k. STM in its turn builds upon identifying covariates to improve the estimation of the topics. Covariates affect i) the proportion of a d devoted to a k (topic prevalence-TP), and ii) how much a word is used in k (topical content-TC). To this purpose: • for TP, Dirichlet θd draws of document-level attention to each topic are replaced with a logistic-normal with a mean vector parameterized as a function of document covariates. • for TC, βk distribution is proportional to a Multinomial logistic regression parameter- ized as indicated below. A (partially collapsed) variational expectation-maximization algorithm is implemented to approximate the posterior (inference). Then posterior predictive checks [cf. Gelman et al., 1996] and tools for model selection as in Roberts et al. (2014) are used. Beyond TP and TC functions of document metadata, the structural topic model can be summarized as: 38 1. Given parameters: i) a variance-covariance matrix for topics Σ, ii) a matrix of ob- served document-level covariates X (journals names and years), and iii) a vector γk (of prevalence of each topic) for each covariate, γk ∼ N (0, σ2 k Ip), sample the topic proportion in each document, vector θd, that is, θd ∼ LogisticNormalK−1(Γ ′ x′d,Σ), Γ = [γ1|...|γK ] as a substitute for the Dirichlet conjugate prior, to conform the topic prevalence model. 2. The core language model given the topic proportion per document θd consists of: • sampling the probability zd,n that a word is in a topic: zd,n ∼ MNK(Θd), with K outcomes • conditional on topic, choose a word from βzd,n , that is wd,n ∼MNV (βzd,n), overB = [β1|...|βK ] matrix of distributions over vocabulary V. 3. The topical content model samples the topic word distribution βd,k,v,. By now we do not use covariates to explain topical content of documents. 39 Appendix B Details of this Pre-processing Data Pre-processing of the abstracts that conform our database is essential in order to organize the words that form the texts in an homogeneous way. The main goal of this process is to reduce the dimensionality by reducing the set of words, but at the same time trying to maximize the information contained in the words used by the authors by selecting the terms with more informational content. This helps us for a better estimation of more semantically meaningful topics. First step is tokenization so as to differentiate words by selecting only single words (monograms), instead of bigrams, trigrams, paragraphs, etc. Then we eliminate punctu- ation, and capital letters are converted to small letters. This allows as to remove dupli- cates, for example ”Education” and ”education” are different words in our database if we don’t convert all the words to lowercase. Once this is done we eliminate numbers and stopwords. By stopwords we refer to those words without any informational content: ”com- mon’ words such as ”and”, ”for”, ”in”, etc. We removed the stop words from the list SMART developed by Buckley (1985), a public list with more than 500 words. Addition- ally, we remove some custom stopwords that were very common in our database but not in- formationally relevant. These are: ‘download’,‘slides’,‘slide’,‘jel’,‘abstract’,‘paper’,‘author’, ‘literature’, ‘among’, ‘whether’,‘authors’, ‘model’, ‘show’, ‘showed’, ‘shows’, ‘find’, ‘can’, ‘matter’,‘model’, ‘models’, ‘may’, ‘effect’, ‘find’, ‘can’, ‘show’, ‘paper’, ‘also’, ‘provide’, ‘ap- proach’, ‘thus’, ‘main’, ‘obtain’,‘obtained’, ‘without’, ‘modelling’, ‘modeling’, ‘modeled’, ‘modelled’, ‘use’, ‘result’, ‘results’, ‘resulting’, ‘resulted’, ‘discuss’, ‘discussed’, ‘discussing’, ‘recent’, ‘recently’,‘give’, ‘gives’, ‘given’, ‘review’, ‘reviewing’, ‘reviews’,‘require’, ‘required’. We end by stemming the tokens so as to retain only the roots of words in the same family,so as to unify the information contained in related words. For example “education”, “educative”, and “educated”, are all related with education, so we just keep the root “educ” for all of them. The use of these stems relax dimensionality problems, and groups all probabilities for families of words into one. In our sample were initially 13,835 different terms. After this process without loss of generality, we reduce the number of unique terms to 4,241 in the corpora with which we build the document term matrix. 40 Appendix C The optimal number of topics To run the model involves a choice of hyperparameters as discussed in Apendix A above, and one of those parameters is the number of this latent topics existing in our corpus. As this can be interpreted as an arbitrary prior, we run some automatic tests in order to choose this optimal K without human intervention, in order to classify texts in the best possible way. This approach gives us the advantage of automatically selecting the number of topics that better fits data. Arbitrary choosing too few topics means to cluster several topics into a single one. Choosing too many topics means would tend to identify patterns in language rather than topics. −6.84 −6.80 −6.76 25 50 75 100 Number of Topics H el d− ou t l ik el ih oo d es tim at io n Figure A. 1: Held-out likelihood estimation We learn a lot on the different patterns of the data when choosing various alternatives for a fixed number of topics, as we will discuss below. However, our primary selection strategy for automatic selection focuses on the held-out likelihood estimated. Figure A.1 reports the log-likelihood of the model evaluated at the estimated parameters on the test set for each K between 15 and 100. The likelihood is maximized between 49 and 54 topics. Figure A.2, in its turn displays the number of iterations to convergence of the model, 41 20 40 60 80 25 50 75 100 Number of Topics Ite ra tio ns to c on ve rg en ce Figure A. 2: Number of iterations to convergence of the model which sharply drops at 54 topics and remains at that number of iterations (except for a small spike at 60) beyond 62 topics. Finally, Figure A.3 reports the semantic coherence which is decreasing and stable after 59 topics. Semantic coherence is maximized when the more frequent words in a given topic co-occur together Mimno et al. (2011). High semantic coherence is reached when in the end there is less topics dominated each by few words. On the other hand, average exclusivity is large when a particular word frequency corresponds to each topic. We follow Roberts et al. (2014) to use the FREX metric for this criteria. As showed in Figure A.4 there are two maximums in 51 and 54 topics. With our data, we found reasonable to assume that the result is in the neighborhood of 52 topics given the held-likelihood procedure, and given the additional tests, we select the highest number of topics in this neighborhood, corresponding to 54 topics. 42 −120 −115 −110 −105 −100 25 50 75 100 Number of Topics S em an tic C oh er en ce Figure A. 3: Semantic Coherence 9.75 9.80 9.85 25 50 75 100 Number of Topics E xc lu si vi ty Figure A. 4: Exclusivity 43 Appendix D The topics profile Given that we have chosen automatically the number of latent topics, it can be helpful to try to disentangle their nature. As an alternative to Figures 7 and 8, we use Simple Correspondence Analysis to measure the distance between topics. This is a descriptive technique to explore relationships among categorical variables. In our application we use the matrix of probabilities (the matrix θd obtained from STM) for each and every document to belong to any particular built-in topic in order to measure the distance between topics. The rows in this matrix are probabilities that add up to one. The clustering of rows measures the distance between topics (the columns of the matrix). This is the so-called chi-square distance: θcolij = r∑ i=1 (pai − paj)2 , where r is the total number of rows, and the measure we compute and represent gives the euclidean distance between columns i, j (col), for each and every row a (abstract). Figure A.5(a) depicts the two larger coordinates of the distance matrix computed through Classical Multidimensional Scaling (MDS), so as to obtain the coordinates of the column category. The coordinates are given by the order of largest-to-smallest variance. We find the corpus organized along two dimensions: Dimension 1 can be interpreted as going from Applied to Theory, whereas Dimension 2 goes from, say, Economics to Econometrics. We think this is apparent from casual inspection of Figure A.5(a),which involves square distances between [−4,+4]. Clearly though, outliers (understood as the topics far away from the origin) are very important in this representation. First, we identify outliers 21, 9,11, that we have associated to Econometric Theory in the fields of estimation (“estim”, “asymptot”,.... are the keywords in this case) and testing (“test”, “asymptot”,...), together with structural econometrics (“identifi”, “instrument”,...) respectively. These actually are are among the top 10 more prevalent topics. Moreover, topics 9 and 11 are 2nd and 3rd most prevalent. These outliers are located North East in the diagram in terms of the language they use. The second set of outliers are located South East and are equally far from the center, while not isolated. These topics can be associated to Economic Theory texts. On top of 44 (a) Whole Sample (b) Zoom-in Sample Figure A. 5: Larger coordinates of the distance matrix computed through Classical Mul- tidimensional Scaling (MDS) 45 those we find topic 5, and then not that further away from the center, topic 6, 16 and 10. These are, respectively, auction theory (auction, bid,...), together with game (game, player,...) and information theory (belief, signal,...), as well as mechanism design (mechan, implement,...). These topics are relatively less prevalent in the sample than the Econometric Theory topics above as we discussed in the main text. Finally, there are some outliers at the North West corner of the diagram. We find here topics that seems to be mostly empirically oriented (applied), and according to our representation, nearly as distant from Econometric than from Economic Theory. These are particularly topics 19 and 49, that we have associated before with Education and Gender issues, and for which female authors’ presence is relatively more prevalent. There is finally a negative correlation between the two coordinates, suggesting that distance values are larger than under the hypothesis of independence between these two key dimensions. This finding would require a treatment that goes beyond the scope in this paper. We leave further analysis of the nature of latent topics in leading economic journals for future research. The interested reader can check the center of the representations at square distances between [−1,+1] in Figure A.5(b). 46 Appendix E Analysis with the abstracts of the Papers Proceed- ing Papers (P&P) In this section, we extend our original sample with the Papers and Proceedings (P&P) articles published in AER in the especial issue of May during the period 2011-2018.14 These P&P articles are very short (for example, they could be just an extension of a full article submitted to a different journal) and they are selected from the papers presented in the annual January meeting of the American Economic Association’s (AEA). Part of the papers are selected directly for the committee’s members of the AEA meetings and others are chosen from external proposals of special sessions in AEA meetings.15 Interestingly for our analysis, papers in P&P are linked to the meeting sessions, and then, they come in groups of 3 or 4 papers of a specific topic. Then, the editorial process of this P&P is very different from regular submissions and the set of topics is likely to be more diverse, since some of the special sessions in AEA meeting may be relevant for current policy debate but not necessarily for research. For example, in the issue of May 2020, among others, we can find two sessions and the corresponding articles over ”The economics of the health epidemics” or ”Is United States deficit policy playing with fire?”. With these additional P&P papers, our sample contains 6,428 abstracts/documents, that generates 253,312 tokens and 12,936 unique terms. The number of topics that best fits the these extended sample is 70. The larger number of latent topics can be related to the larger number of unique words and documents, but also to the selection process of P&P described above, sessions unrelated to standard research with a small number of (”seed”) papers very related among themselves. As in the main text, we estimate these 70 latent topics using the STM algorithms. Figure A.6 presents the latent topic ranked by prevalence in the corpus with k = 70. Figure A.7 show the STM output (the estimated latent topics) and also how the docu- ments are allocated among them. As in the main text, in the Figure A.7 the size of the circle is proportional to the number 14Before 2011 the P&P articles did not have abstract and after 2018 the P&P articles are included in a different journal. 15For more information about the about the AEA Papers and Proceedings go to: https://www.aeaweb.org/journals/pandp/about-pandp 47 of documents in the topic. The most salient feature of the Figure A.7 is that in addition to the larger number of topics, there are some of them with very small size that could be related to the ”seeds” described above, sessions of the AEA meetings, with very related papers among themselves but quite different to research papers closer to them. Figure A.8 reinforce the evidence of the main message of this paper, male and female display different pattern when doing research. There is a subset of topics (South-East in the figure A.8) with a relative high proportion of females, that moreover seems to be closely connected. On the contrary, there is other set of topic (South-West in the Figure A.8) that is is also closely connected and where the present of females is relatively scarce. Now, we want to look closer the content of some particular topics. In this larger sample, it is easier to see that the latent topics go beyond standard research fields. In particular, Figure A.9 points out that the latent topics with higher proportions of female authors are topic 41 and topic 19. In the following figure we can see the distributions over terms that each of this two topic induces are represented as words clouds, where the size of term in the cloud is approximately proportional to its probability in the latent topic distribution βk. Clearly, topic 41 in related with family economics and topic 19 with gender discrimination. 48 use law select mobil conflict includ exchang process crime news distort program communic bargain right student work save technolog vote innov women tax bank regul credit decis search distribut match group percent auction educ state hous health price capit choic polit asset behavior econom predict social school employ experi risk region invest polici measur welfar game incom market dynam util inform equilibria shock test price trade agent firm variabl estim develop promot advers exclus bad mani organ mean black media capit particip receiv commit integr colleg time citi manag voter patent gender margin fund cost borrow rule unemploy two network individu year bid children unit area insur adjust skill prefer govern investor prefer empir theori public effect worker subject return econom financi debt differ cost player household competit equilibrium prefer belief exist aggreg asymptot consum countri contract product function method effect minimum associ user self sever coordin converg white advertis optim target inform delay ownership univers suppli age adopt bias research gap rate financi emiss default make job first stabl treatment increas bidder parent resourc local care chang human altern parti market concern research data privat student wage experiment avers ethnic growth govern countri benefit equilibrium consumpt effici economi expect learn condit cycl statist demand intern optim industri condit data compar polici distinguish platform pool wide rate frequenc racial violenc tax evalu disclosur option contract educ hour popul perform candid spillov femal incom crisi electr loan individu wage second interact intervent point mechan famili centuri increas hospit real wage set polici stock differ question empir enforc score labor field rate agricultur capit monetari across insur payoff wealth equilibrium polici ambigu signal set busi paramet good sector incent produc identif sampl part behavior favor farm strateg rang donor long crimin effect economi subsidi sender offer valu minor labor retir increas elect knowledg men optim run environment secur member friction order agent expect percentag valu child protect effect increas rate occup choos power trade fair use test good test job treatment discount migrat financ rate account gain strategi inequ ineffici general uncertainti agent class volatil confid sale import mechan cost identifi use substanti water surpris entri good varieti transplant revers collus newspap redistribut impact report agreement vertic graduat transfer rate practic polici new male elast lend energi debt maker labour size structur mortal rate price marriag war counti patient cost inequ ration institut liquid explain replic forecast provis outcom earn report equiti histor flow interest per polici action data trade determin function privat finit fluctuat interv purchas export problem entri restrict consist place evid probabl contract success set group time polic increas progress effect strateg outsid acquisit cours increas growth new prefer growth marri taxat liquid climat constraint collect worker third substitut rate immigr valuat fertil capac impact spend inflat worker attent democraci price peopl correct consist ident assign employe random premium institut economi fiscal correl reform play expenditur alloc stochast probabl observ equilibrium labor distribut profit foreign princip qualiti instrument bias empir principl new incumb will hold effici jump rate content polici train messag power properti enrol home increas chang one data like effect system pollut rate bias market three sort control estim revenu birth system neighborhood drug nomin increas maxim politician portfolio other import time contribut teacher increas design time conflict constraint spend develop reduc cooper earn seller condit set common continu macroeconom procedur increas tariff implement market bound paramet allow chang entir data control fact kidney observ race internet incom elig truth negoti supplier major forc plan improv major effect less increas reserv estim mortgag made durat joint heterogen health averag buyer invest world exposur medic data chang reveal legisl money individu data theoret norm high data studi high data recess lower relat loss strateg use buyer monetari character ration general respons method retail domest alloc input structur error new effect sourc increas leader analysi donat depend arrest evid taxat applic set compens industri admiss week individu farmer campaign technolog differ chang deposit carbon interest case match simpl complementar life annual privat health earli estim medicar index accumul random elector bond theori analysi implic role achiev industri effect intertempor popul crise effect data increas nash share outcom endogen represent prior comput output propos quantiti factor design higher unobserv number studi reduc kind deal fear describ increas memori judg estim govern disabl lie period sharehold perform suggest larg advanc elector impact discrimin estat feder polici card status rate tail form effect decreas seller mother econom locat qualiti relat return problem support hold relat name inconsist cultur impact hire evid riski effect declin inflat indic improv repeat increas qualiti framework subject expect time suppli bootstrap durabl global effort plant nonparametr panel base prohibit evid use peac particular central rate defend suggest labour receiv manipul fee profit rank leav urban manageri win product evid use market gas market propos equilibrium domin type random use winner intergener american segreg coverag respons growth plan elect valu refer statist develop institut peer sector laboratori stock variat friction sovereign signific provid outcom survey reput chang individu type valu idiosyncrat infer effect develop project heterogen treatment approxim incorpor data potenti profit group import decentr constitut data polit studi aid disclos probabl structur onlin respons german diffus strateg inventor suggest respons effect chang increas influenc vacanc upper link increas data post age europ pollut estim markup abil inattent public fundament suggest economist use motiv attend transit differ term growth develop zero variat analysi type top incent regim properti structur game persist base margin good studi export use base document new isol time lead perspect number propens sentenc time pareto random cost will acquir like cash germani frontier extrem use posit formula balanc fuel home differ offer establish observ incent period design young problem live premium evid differ consider corrupt volatil like studi futur level use manufactur conduct higher religi net central poor potenti deviat individu studi fundament avers asymmetr perfect uncertainti sampl surplus relat moral size assumpt asymptot consist increas advantag store even multipl larg base applic elect generat well condit accept land abil declin account technic expert activ rate evas failur plant lender major effect present central peopl reduc equilibrium attain chang data plan larg earn best alloc trader endow theoret new monitor random loss natur expect exploit boom commit gdp use equilibria chang trader futur lotteri precis theorem elast valid lower special hazard increas set factor 28.6% 15.9% 17.6% 17.8% 16.8% 17% 13.8% 12.5% 16.7% 13.4% 10% 18% 12.7% 16.7% 12.7% 17.4% 14% 16.2% 42.2% 15.2% 15.4% 15.7% 18.7% 14.7% 18.1% 12.6% 17% 15% 21.4% 15.8% 29.4% 21.5% 21.3% 15.3% 17.8% 19.3% 19.2% 15.1% 18.8% 22.5% 31.7% 17.5% 19.9% 21.5% 14% 11% 24.2% 21.4% 15.4% 16.4% 14.7% 16.5% 14.5% 16.8% 14.9% 20% 18.8% 16% 25.7% 25.9% 18.8% 15.8% 21% 21.2% 16.4% 17.6% 18.2% 17.3% 16.1% 17.9% 0.8% 1.3% 0.6% 1.8% 1.4% 1.2% 0.8% 0.6% 1.7% 0.9% 2.1% 1.2% 2.1% 1.3% 2.6% 1.3% 1.3% 1.6% 1.1% 1.5% 2.2% 2.8% 0.8% 2.1% 2.7% 1.8% 0.7% 1.1% 1.7% 1.1% 1.3% 1.1% 1% 1% 1.1% 1.6% 1.7% 1.2% 1.8% 1.3% 1.3% 2.3% 1% 1% 0.8% 2.1% 1% 1.2% 2.2% 1.6% 0.8% 1.8% 1.1% 2.7% 1.5% 1.7% 0.6% 2% 0.9% 1.4% 1.9% 1.5% 0.5% 1.3% 1.8% 1.4% 1.7% 1.6% 2.1% 0.3%Topic 70 Topic 63 Topic 3 Topic 57 Topic 8 Topic 27 Topic 51 Topic 7 Topic 1 Topic 23 Topic 45 Topic 59 Topic 10 Topic 34 Topic 44 Topic 47 Topic 33 Topic 43 Topic 32 Topic 53 Topic 35 Topic 19 Topic 30 Topic 28 Topic 48 Topic 12 Topic 6 Topic 38 Topic 2 Topic 16 Topic 31 Topic 40 Topic 17 Topic 41 Topic 14 Topic 64 Topic 60 Topic 5 Topic 66 Topic 20 Topic 62 Topic 55 Topic 68 Topic 18 Topic 50 Topic 36 Topic 29 Topic 67 Topic 37 Topic 9 Topic 56 Topic 52 Topic 65 Topic 4 Topic 39 Topic 26 Topic 61 Topic 58 Topic 13 Topic 46 Topic 24 Topic 11 Topic 69 Topic 21 Topic 49 Topic 42 Topic 15 Topic 25 Topic 54 Topic 22 Topic Prop. Female Prop. 10 20 30 Word Prevalence (%) 10 20 30 40 Topic Proportions (%) (White = median Female Prop.) Figure A. 6: Latent topics ranked by prevalence in the corpus with k = 70. Extended sample with P&P articles. 49 Figure A. 7: Connectedness between topics and the fraction documents/abstracts in each topic (θd distribution). Extended sample with P&P articles. 50 Figure A. 8: Connectedness between topics and the female authors documents/abstracts in each topic. Extended sample with P&P articles. 51 (a) Topic 41 (b) Topic 19 Figure A. 9: Topic Word Clouds in the extended sample with P&P articles 52 ICAE_WP_CGGP.pdf Introduction Raw Data and Descriptive Analysis The Empirical Model: Structural Topic Model (STM) Gender Differences in Latent Estimated Topics Topic Prevalence Topic analysis and the gender distribution Topics as Research Fields Conclusions References The topic Model Details of this Pre-processing Data The optimal number of topics The topics profile Analysis with the abstracts of the Papers Proceeding Papers (P&P)