How to choose your field? 

You should observe experienced scientists

How did you choose your scientific field? 

Was it a careful process? Could you give quantitative reasons at the time? If you're like me, chances are it was a bit of a random walk, influenced by what courses you liked, a specific inspiring high school teacher, or encouragement from friends and family. 

Is there a better way to find meaningful and important work

In particular, what quantitative information does a young scientist need?

My hypothesis: Experienced scientists who switch fields are a strong signal for what is important science

The dataset: author publishing trends on ArXiv

Here, I measure author trends over 2.5 million papers published in the last 3 decades on the preprint server ArXiv (widely used by math, physics, computer science, etc). 

What was measured (click to expand)

Results: Overall trends

First, here is a summary figure showing the major trends in the dataset. 

Fig. 1. Authorship trends on ArXiv in each major field, and cumulative net transitions by existing authors between fields.

Comments on major trends

Results: Details within fields

We can also measure how people move between specific fields, to begin to answer questions like: 

Fig. 2, Fig. 3, and Fig. 4 show author switching trends for a few specific fields, including AI, math, and condensed matter physics (if you are interested in other fields, see Appendix 3).

AI related fields (click to expand)


Fig. 2. Authorship trends for AI related fields. 

In summary:



Fig. 3. Authorship trends for math. 

Condensed matter physics


Fig. 4. Authorship trends for condensed matter physics

Summary and reflections

We have looked at trends in scientific authorship over the last few decades, focusing on how existing authors move between fields, as a signal for value and importance of different fields. This has revealed trends in overall preference, as well as fine-grained trends for switching between pairs of fields, year by year. 


Appendix 1: Additional measurement details (click to expand)

Here I provide a longer summary of some of the analysis methods.

In this post, I measure author trends on the preprint server ArXiv (widely used by math, physics, computer science, etc). The data is from a Kaggle dataset of all 2.5 million published papers, including author names, and the categories of their subject field. For all data analysis, and plots, and to mess with the data yourself, see my Github.

On ArXiv, each paper is published in a certain subject category (or a few categories). This lets us track which fields a given author publishes in each year.


How I combined published fields

Note: as part of the analysis, I collapse various categories on Arxiv, ('astro-ph.CO': 'Cosmology and Nongalactic Astrophysics', and 'astro-ph.EP': 'Earth and Planetary Astrophysics') into one category, like "astrophysics". This is a choice I made.

Appendix 2: Caveats and pitfalls of this analysis (click to expand)

A few details to be aware of.

Fundamental assumption that switching = value

I am assuming that switching fields is a good indicator of value, but people may also switch for other reasons. 

ArXiv isn't complete (not representative of all scientific fields)

This is only ArXiv publishing data. Ideally I would want to include more databases across a broader range of fields (ie. bioRxiv), and characterize the paper categories in a more general way.

Different fields have different publishing behaviors

I have tried to make the analysis robust to the publishing behaviors of different fields, but I may have missed something. I have tried to make it so that the moment an author switches (really switches) is measured as a transition of 1 author during that year. 

Author names are not a good way to identify unique authors

I use each author name as a unique string to identify an author, but this isn't correct. Some people have the same name, and I will therefore count them as the same person. To avoid this to some degree, I don't count any authors with more than 100 publications. This should help, but will then mess up the total author number. In the end, it's a tradeoff, and would be better to use something like Orchid identifiers.

Using names, which people share, could give the illusion of people switching fields if a new person appears who is in a growing field, with the same name as an older person in another field. But the fact that I don't see transitions from astro to other places might mean it's fine.

I made specific choices on how to group scientific field designations

I made specific choices on which archive subfields to consider "AI fields" and which to classify together as quantum physics, for example. Changing this will change conclusions slightly. This is slightly more complicated than it first seems: for example, astrophysics had a change of it's ArXiv organizations at one point, so there's massive flux between fields within ArXiv as people adjusted.

I measure net author switching rates

One might argue that the absolute switching rates are a more relevant metric. For example, if the field of AI is growing at rate X, but what's actually happening is people are transitioning into AI with rate 3X and out with rate 2X, that's probably worth knowing. However, it's not obvious how to measure this well (since it depends on the timing bin size you choose to measure transitions on). The net rate is nicely independent of the timing bin size. 

Appendix 3: Data by field

See also my Github project, to easily generate and play with these graphs, or alter the analysis in new ways.