readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

MSU2580 · 2019-02-09T23:07:31Z

After reading in a corpus of scientific work in the following manner, we eventually get an object back out (FinalSciCorp) which contains all unique words within the corpus, along with some other information:

> tempsci <-  readtext("*.txt",encoding = "UTF-8") 
> sciCorp <- corpus(tempsci)
 >  doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)
> FinalSciCorp = textstat_frequency(doc_term_matrix)

However, FinalSciCorp still contains some words with ligatures such as "ff" and "fi", among others. As an example, FinalSciCorp contains both the words "field" and "<U+FB01>eld", or in another case just the word "signi<U+FB01>cant". The 'encoding' and 'stri_enc_detect' functions both indicate that the files are likely "UTF-8" although we have also tried many other options, including "latin1" for encoding.

kbenoit · 2019-02-10T22:55:53Z

Can you send me a link to one of the documents containing a ligature, as well as your sessionInfo() output, so I can test it?

MSU2580 · 2019-02-11T04:24:24Z

Hello, I apologize if you prefer another format (let me know), but the following should contain the information you desire. Note that attached is one of the documents which contains ligature issues. The sessionInfo() data:

sessionInfo()

R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readtext_0.71 quanteda_1.3.14 loaded via a namespace (and not attached): [1] Rcpp_0.12.16 magrittr_1.5 stopwords_0.9.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0 [9] fastmatch_1.1-0 stringr_1.3.0 httr_1.3.1 plyr_1.8.4 tools_3.4.4 grid_3.4.4 data.table_1.10.4-3 gtable_0.2.0 [17] spacyr_1.0 yaml_2.1.18 lazyeval_0.2.1 RcppParallel_4.4.2 tibble_1.4.2 Matrix_1.2-12 ggplot2_2.2.1 ISOcodes_2018.06.29 [25] stringi_1.1.7 compiler_3.4.4 pillar_1.2.1 scales_0.5.0 lubridate_1.7.4 The code I utilized (as it pertains to just the attached file): library(quanteda) library(readtext) tempsci <- readtext("H0101.txt",encoding = "UTF-8") sciCorp <- corpus(tempsci) doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE) FinalSciCorp = textstat_frequency(doc_term_matrix) Thank you for your time, it is most appreciated! Best,

…

____________________________________________________ Keith L. Johnson, Ph. D. Montana State University Physics Department, Barnard Hall 226 PER Group

________________________________ From: Kenneth Benoit <notifications@github.com> Sent: Sunday, February 10, 2019 3:55:53 PM To: quanteda/readtext Cc: Johnson, Keith; Author Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146) Can you send me a link to one of the documents containing a ligature, as well as your sessionInfo() output, so I can test it? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#146 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AtNcbUh-PkYWXZp0h5KRO3avv2IgqhWaks5vMKN5gaJpZM4ayygK>. 40-year trends in an index of survival for all cancers combined and survival adjusted for age and sex for each cancer in England and Wales, 1971–2011: a population-based study Summary Background Assessment of progress in cancer control at the population level is increasingly important. Population-based survival trends provide a key insight into the overall eﬀ ectiveness of the health system, alongside trends in incidence and mortality. For this purpose, we aimed to provide a unique measure of cancer survival. Methods In this observational study, we analysed trends in survival with population-based data for 7·2 million adults diagnosed with a ﬁrst, primary, invasive malignancy in England and Wales during 1971–2011 and followed up to the end of 2012. We constructed a survival index for all cancers combined using data from the National Cancer Registry and the Welsh Cancer Intelligence and Surveillance Unit. The index is designed to be independent of changes in the age distribution of patients with cancer and of changes in the proportion of lethal cancers in each sex. We analysed trends in the cancer survival index at 1, 5, and 10 years after diagnosis for the selected periods 1971–72, 1980–81, 1990–91, 2000–01, 2005–06, and 2010–11. We also estimated trends in age-sex-adjusted survival for each cancer. We deﬁne the diﬀerence in net survival between the oldest (75–99 years) and youngest (15–44 years) patients as the age gap in survival. We evaluated the absolute change (%) in the age gap since 1971. Findings The overall index of net survival increased substantially during the 40-year period 1971–2011, both in England and in Wales. For patients diagnosed in 1971–72, the index of net survival was 50% at 1 year after diagnosis. 40 years later, the same value of 50% was predicted at 10 years after diagnosis. The average 10% survival advantage for women persisted throughout this period. Predicted 10-year net survival adjusted for age and sex for patients diagnosed between 2010 and 2011 ranged from 1·1% for pancreatic cancer to 98·2% for testicular cancer. Net survival for the oldest patients (75–99 years) was persistently lower than for the youngest (15–44 years), even after adjustment for the much higher mortality from causes other than cancer in elderly people. Interpretation These ﬁndings support substantial increases in both short-term and long-term net survival from all cancers combined in both England and Wales. The net survival index provides a convenient, single number that summarises the overall patterns of cancer survival in any one population, in each calendar period, for young and old men and women and for a wide range of cancers with very disparate survival. The persistent sex diﬀerence is partly due to a more favourable cancer distribution in women than men. The very wide diﬀerences in survival for diﬀerent cancers, and the persistent age gap in survival, suggest the need for renewed eﬀ orts to improve cancer outcomes. Future monitoring of the cancer survival index will not be possible unless the current crisis of public concern about sharing of individual data for public health research can be resolved. Introduction Cancer is an increasing public health concern, shown by substantial investments in human and ﬁ nancial resources for cancer management since the late 1990s. Health policy measures have focused on improvement of the organisation and delivery of services for prevention, diagnosis, and treatment. Research has provided the evidence base for these policies and is increasingly used to assess their eﬀ ect.1–7 The assessment of progress in cancer control has become crucial. Population-based cancer survival trends provide a key insight into the overall eﬀ ectiveness of the health system, alongside incidence and mortality.8 In this population-based survival study, we analysed cancer survival trends during the past four decades in England and Wales using two metrics: an index of survival for all cancers combined, and survival for each cancer, adjusted for age and sex. The all-cancers survival index was designed to provide one summary measure of cancer survival that can be monitored over time to show the overall progress in the eﬀ ectiveness of the health-care system. It was also designed to support assessment of the eﬀ ect of earlier diagnosis, which is a key component of the National Awareness and Early Diagnosis Initiative.9–11 Trends in survival for individual cancers will underline those cancer types for which 1206 there has been progress and those for which prognosis has remained poor. Methods Study design Survival varies very widely with the age and sex of a patient with cancer and with the type of cancer. The frequency of diﬀ erent cancers is also changing over time: some cancers with poor prognosis, such as stomach and lung cancer, have become less common, whereas breast cancer in women, for which survival has been improving, has become more common. These trends can diﬀ er between the sexes: lung cancer has become much less common in men, but more common in women. The age proﬁ le of patients with cancer also changes over time, and these trends can diﬀ er between cancers. To enable valid assessment of survival trends for all cancers combined, the survival index must therefore take account of changes over time in the distribution of age, sex, and cancer type in all patients with cancer, especially over periods as long as 40 years. Similarly, trends in survival for each cancer must be adjusted for changes over time in the age (and sex) proﬁ le of patients with cancer. Data sources We examined survival trends in 7 176 795 adults (aged 15–99 years) diagnosed with a ﬁ rst, primary, invasive malignancy in England and Wales during 1971–2011, and followed up to Dec 31, 2012 (table 1). Data for England were obtained from the National Cancer Registry at the Oﬃ ce for National Statistics12 and for Wales from the Welsh Cancer Intelligence and Surveillance Unit. Patients diagnosed with a malignancy of the skin other than melanoma were excluded. Since 1971, the National Health Service Central Register has routinely updated these individual cancer records with information about each patient’s vital status (alive, emigrated, dead, or not traced). The vital status at Dec 31, 2012, was known for 98·4% of these patients. During the 41-year period, 4·3% of all cancer registrations were for the patient’s second-order or higher-order tumour: in the analyses for all cancers combined, the higher-order cancers were not included. Statistical analysis The all-cancers survival index was constructed as a weighted average of the survival estimates for every combination of age group at diagnosis (15–44, 45–54, ICD-10 code* England Oesophagus Stomach Colon Rectum Pancreas Larynx (men) Lung Melanoma Breast (women) Cervix Uterus Ovary Prostate Testis Kidney Bladder Brain Hodgkin’s disease Non-Hodgkin lymphoma Myeloma Leukaemia Other cancers† Total C15 C16 C18 C19–C21 C25 C32 C33, C34 C43 C50 C53 C54, C55 C56, C57.0–7 C61 C62 C64–C66, C68 C67 C71 C81 C82–C85 C90 C91–C95 ·· ·· Women Number 67 474 115 294 292 352 143 610 92 631 ·· 349 711 97 627 1 039 609 117 404 160 539 172 400 ·· ·· 53 197 90 204 41 952 19 114 99 752 43 446 70 760 275 408 3 342 484 Men % Number % 2·0% 3·4% 8·7% 4·3% 2·8% ·· 10·5% 2·9% 31·1% 3·5% 4·8% 5·2% ·· ·· 1·6% 2·7% 1·3% 0·6% 3·0% 1·3% 2·1% 8·2% 100·0% 106 793 194 333 271 220 204 363 93 450 52 618 751 958 72 743 ·· ·· ·· ·· 638 111 48 031 89 986 239 621 59 192 26 714 114 269 48 136 92 917 296 794 3 401 249 3·1% 5·7% 8·0% 6·0% 2·7% 1·5% 22·1% 2·1% ·· ·· ·· ·· 18·8% 1·4% 2·6% 7·0% 1·7% 0·8% 3·4% 1·4% 2·7% 8·7% 100·0% Wales Women Number 4953 8627 17 711 9731 5868 ·· 21 027 5429 65 370 8272 10 836 11 051 ·· ·· 3431 5897 2832 1145 5630 2805 4686 18 624 213 925 Men % Number % 2·3% 4·0% 8·3% 4·5% 2·7% ·· 9·8% 2·5% 30·6% 3·9% 5·1% 5·2% ·· ·· 1·6% 2·8% 1·3% 0·5% 2·6% 1·3% 2·2% 8·7% 100·0% 6857 14 299 17 736 14 358 6014 3529 45 601 4372 ·· ·· ·· ·· 41 559 2743 5804 15 962 3786 1675 6320 3041 6112 19 369 219 137 3·1% 6·5% 8·1% 6·6% 2·7% 1·6% 20·8% 2·0% ·· ·· ·· ·· 19·0% 1·3% 2·6% 7·3% 1·7% 0·8% 2·9% 1·4% 2·8% 8·8% 100·0% *Tenth revision of the International Classiﬁ cation of Diseases (ICD): malignancies were initially coded according to the ICD revision in use during the year of diagnosis—ie, ICD 8 (1971–78), 9 (1979–95), or 10 (1996–). †Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men. Table 1: Number of patients (aged 15–99 years) included in analyses in England and Wales diagnosed from 1971 to 2011 and followed up to 2012, by sex and type of malignancy 1207 55–64, 65–74, and 75–99 years), sex (male and female), and type of cancer (the 21 most common malignancies are shown in table 1 and all other malignant tumours are combined). The weights used were the proportion of patients with cancer diagnosed in England and Wales during 1996–99 in each of the 185 combinations of age group, sex, and type of cancer. We also constructed the all- cancers survival index separately for males and females and estimated survival adjusted for age and sex by cancer. All adults, England All adults, Wales 1-year survival 5-year survival 10-year survival Prediction 61 41 34 56 35 28 65 47 41 56 35 29 50 30 24 Men, England 51 30 23 45 25 20 Women, England 61 41 34 55 34 28 70 54 50 68 51 46 65 47 42 67 49 46 74 59 54 64 46 41 72 56 50 61 42 36 69 53 47 54 35 29 48 29 24 48 29 23 Men, Wales 43 25 20 Women, Wales Period of diagnosis (year) Period of diagnosis (year) Figure 1: Trends in the index of net survival for all cancers combined, for England and for Wales: all adults (15–99 years), men, and women, selected periods during 1971–2011 Net survival was used as the cancer survival measure for each component of the indexes. Net survival quantiﬁ es the survival after taking account of death from other causes (background mortality). All patients were allocated a deprivation category deﬁ ned according to their Lower Super Output Area (mean population about 1500) of residence at the time of cancer diagnosis. Life- tables were used to take account of the wide variation in background mortality by age, sex, deprivation, region, and over time. For this study, separate life-tables were created for England and Wales by single year of age, sex, deprivation category, and (in England) region of residence, for every calendar year between 1971 and 2012.13 National or regional life-tables were used for the 2·8% of patients diagnosed in England (2·6% in Wales) who could not be assigned to a speciﬁ c deprivation category or (in England) region; almost all of these patients were diagnosed in the 1970s (85% in England, 55% in Wales) or 1980s (14% England, 44% Wales). We used ﬂ exible multivariable parametric excess hazard models14,15 to estimate net survival up to 10 years after diagnosis for each nation, and for each stratum deﬁ ned by cancer, sex, age group, and calendar period. The models included age and year of diagnosis as main eﬀ ects, modelled on a continuous scale with restricted cubic splines, to account for potential non-linear excess (cancer-related) hazards. Interactions between age and year of diagnosis, year of diagnosis and follow-up time, and age and follow-up time were assessed to deal with potential variation of the excess hazard with time since diagnosis. The best-ﬁ tting models were chosen as those with the smallest Akaike Information Criterion.16 Net survival curves were estimated for each individual from these models according to their age and year of diagnosis. We obtained net survival estimates for each cancer and sex by averaging of individual net survival curves, over all ages and years of diagnosis within each age group and calendar period. In view of the fact that the models included the year of diagnosis as a continuous variable, we were able to predict survival up to 10 years after diagnosis, even for the patients diagnosed most recently (ie, 2010–11). All models were ﬁ tted with the STATA command stpm2 using STATA 13.1.17,18 We included all patients diagnosed during the 40 years from 1971 to 2011 in the models to estimate survival trends, but we report estimates for each cancer survival index at 1, 5, and 10 years after diagnosis only for six selected periods of diagnosis: 1971–72, 1980–81, 1990–91, 2000–01, 2005–06, and 2010–11. We deﬁ ne the diﬀ erence the oldest (75–99 years) and youngest (15–44 years) groups as the age gap in survival. We provide a simple summary of changes in survival by age as the absolute change (%) in the age gap since 1971. A negative value for this change means that the age gap has become wider. For Wales, reliable estimates of net survival could not be obtained for 11·5% of the age-sex-cancer combinations because in net survival between 1208 1971–72 1980–81 1990–91 2000–01 2005–06 2010–11 (prediction) 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 53·3% 54·1% 52·2% 10·6% 10·2% 11·0% 15·0% 14·7% 15·6% 15·4% 15·3% 15·5% 41·5% 42·6% 40·4% All cancers com bined 50·1% All patients Men 44·7% Women 55·5% Oeso phagus All patients Men Women Stomach All patients Men Women Colon All patients Men Women Rectum All patients Men Women Pancreas All patients Men Women Larynx Men Lung All patients Men Women Me lanoma of skin All patients Men Women Breast Women Cervix Women Uterus Women Ovary Women Prostate Men Testis Men Kidney All patients Men Women 81·6% 74·5% 86·7% 44·9% 45·4% 43·9% 43·7% 66·1% 81·9% 74·0% 75·6% 80·7% 16·0% 16·3% 15·4% 83·3% 29·8% 24·0% 25·2% 19·9% 27·9% 34·3% 55·8% 35·3% 50·6% 29·6% 61·0% 40·9% 28·8% 23·3% 34·1% 4·3% 4·0% 4·8% 5·2% 5·2% 5·3% 24·6% 25·3% 23·8% 24·2% 23·6% 25·0% 2·3% 2·4% 2·2% 3·5% 3·3% 3·9% 4·0% 4·0% 4·0% 22·8% 23·0% 22·6% 20·1% 19·1% 21·6% 1·2% 1·3% 1·1% 19·1% 18·5% 20·0% 20·6% 20·7% 20·5% 5·3% 4·8% 6·2% 8·2% 8·1% 8·4% 54·0% 34·2% 34·6% 55·2% 52·7% 33·8% 60·6% 61·4% 59·5% 12·1% 12·4% 11·9% 32·5% 32·0% 33·2% 2·8% 3·1% 2·4% 4·3% 3·8% 5·0% 6·7% 6·7% 6·8% 31·8% 31·5% 32·1% 28·2% 27·1% 29·6% 1·5% 1·7% 1·2% 60·6% 55·7% 65·3% 24·2% 24·1% 24·3% 26·8% 27·0% 26·5% 62·1% 63·5% 60·7% 6·5% 6·1% 7·1% 10·9% 10·6% 11·5% 41·6% 41·9% 41·3% 67·8% 42·0% 68·7% 41·7% 66·6% 42·4% 13·0% 13·5% 12·5% 2·8% 3·2% 2·4% 41·0% 34·8% 47·2% 34·4% 28·0% 40·7% 64·9% 47·4% 41·6% 60·7% 42·0% 36·0% 47·0% 69·0% 52·7% 5·1% 4·8% 5·6% 8·9% 8·6% 9·4% 38·6% 38·1% 39·0% 37·7% 36·7% 39·0% 1·5% 1·7% 1·3% 31·1% 32·5% 28·8% 8·8% 9·1% 8·2% 33·9% 14·1% 34·7% 13·9% 32·4% 14·5% 7·0% 7·3% 6·5% 11·3% 11·0% 11·8% 47·5% 44·5% 66·7% 68·1% 47·6% 43·6% 65·4% 47·5% 45·4% 74·0% 51·2% 47·1% 74·8% 51·0% 46·4% 72·8% 51·4% 48·2% 14·7% 15·3% 14·0% 2·7% 3·0% 2·4% 1·2% 1·4% 1·1% 67·6% 63·7% 71·5% 36·4% 38·3% 33·4% 37·8% 39·3% 35·2% 70·3% 71·9% 68·6% 76·7% 77·5% 75·6% 17·4% 18·1% 16·7% 50·9% 45·8% 45·8% 41·0% 56·0% 50·5% 70·5% 54·3% 49·8% 66·7% 49·2% 45·7% 74·2% 59·2% 53·8% 11·5% 12·0% 10·8% 9·3% 9·4% 9·1% 42·0% 15·3% 12·4% 44·3% 15·6% 12·0% 38·6% 14·7% 13·1% 16·3% 13·1% 16·5% 13·0% 16·1% 13·1% 41·7% 18·8% 15·0% 43·8% 19·5% 15·3% 37·9% 17·7% 14·4% 52·6% 50·3% 52·9 49·4% 52·3% 51·1% 73·9% 58·2% 56·9% 76·1% 59·2% 56·5% 71·7% 57·3% 57·4% 55·5% 51·7% 55·4% 51·0% 55·7% 52·7% 79·2% 59·7% 56·1% 79·9% 59·6% 55·5% 78·1% 59·8% 57·0% 3·0% 3·2% 2·7% 1·2% 1·2% 1·2% 20·9% 21·7% 20·2% 3·3% 3·6% 3·1% 1·1% 1·1% 1·1% 60·2% 50·4% 81·7% 62·1% 52·6% 82·8% 64·1% 54·9% 83·7% 66·0% 57·0% 84·2% 67·0% 58·2% 84·7% 67·9% 59·2% 4·6% 4·8% 4·3% 3·1% 3·2% 2·9% 18·3% 18·6% 17·8% 5·5% 5·8% 5·0% 3·7% 3·9% 3·2% 20·5% 20·4% 20·7% 6·0% 6·1% 5·9% 3·8% 3·9% 3·7% 24·4% 23·9% 25·2% 6·9% 6·6% 7·4% 4·0% 3·7% 4·5% 52·3% 46·4% 34·9% 40·5% 61·1% 54·9% 88·7% 66·4% 60·4% 56·4% 49·8% 84·5% 91·8% 73·7% 68·3% 93·1% 77·2% 90·8% 69·8% 94·9% 82·6% 71·9% 63·4% 78·2% 95·5% 83·8% 79·7% 94·0% 78·4% 73·3% 96·6% 87·8% 84·5% 28·0% 27·0% 29·7% 96·4% 95·2% 97·3% 8·0% 7·4% 9·1% 4·4% 3·8% 5·4% 32·2% 30·5% 35·1% 9·6% 8·4% 11·6% 5·0% 4·0% 6·6% 87·0% 84·4% 82·6% 79·3% 90·2% 88·3% 97·4% 90·4% 89·8% 96·6% 87·8% 86·8% 97·9% 92·4% 92·1% 52·7% 40·1% 85·9% 61·2% 48·4% 89·5% 71·1% 60·0% 92·7% 80·2% 71·6% 94·5% 83·9% 75·6% 96·0% 86·7% 78·5% 51·3% 46·0% 78·6% 58·3% 52·4% 81·6% 62·6% 57·2% 82·8% 65·4% 60·7% 82·6% 66·3% 61·9% 82·9% 67·5% 63·1% 59·0% 55·5% 79·5% 65·1% 61·5% 83·3% 69·5% 65·6% 86·9% 73·1% 69·7% 88·7% 75·9% 73·3% 90·3% 78·8% 77·4% 20·5% 17·9% 50·2% 24·9% 21·5% 57·0% 30·8% 26·4% 64·7% 38·4% 31·7% 68·8% 42·4% 33·5% 72·7% 46·4% 34·8% 36·9% 25·1% 71·5% 38·2% 24·4% 79·6% 49·6% 34·1% 89·5% 73·8% 62·4% 92·4% 81·4% 75·1% 94·0% 84·8% 83·6% 70·5% 69·2% 91·2% 84·0% 83·3% 95·8% 92·3% 91·9% 98·0% 96·3% 96·2% 98·7% 97·5% 97·4% 99·1% 98·3% 98·2% 28·5% 28·9% 28·0% 23·0% 23·0% 23·1% 51·3% 52·6% 49·1% 34·1% 35·3% 32·2% 27·6% 28·5% 26·1% 57·1% 58·7% 54·4% 39·4% 40·8% 37·1% 32·3% 33·4% 30·5% 62·8% 44·8% 37·9% 63·9% 45·2% 37·8% 60·9% 44·0% 38·0% 67·2% 68·0% 65·9% 49·8% 43·0% 50·0% 42·9% 49·4% 43·2% 72·5% 56·3% 49·6% 73·2% 56·7% 50·0% 71·3% 55·6% 48·9% (Table 2 continues on next page) 1209 1971–72 1980–81 1990–91 2000–01 2005–06 2010–11 (prediction) 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 7·2% 6·6% 7·9% 56·5% 54·2% 59·4% 29·9% 29·3% 30·6% 17·7% 17·6% 17·9% 75·6% 73·9% 77·8% 39·3% 60·2% 62·8% 40·9% 53·4% 35·2% (Continued from previous page) Bladder All patients Men Women Brain All patients Men Women Hodgkin’s disease All patients Men Women Non-Hodgkin lym phoma All patients Men Women Multiple myeloma All patients Men Women Leu kaemia All patients Men Women Other cancers* All patients Men Women 49·5% 49·4% 49·6% 11·8% 12·1% 11·4% 13·1% 13·1% 13·0% 37·4% 36·8% 38·0% 34·2% 35·4% 32·5% 55·3% 57·3% 53·0% 32·4% 33·7% 29·0% 5·4% 5·0% 6·0% 47·7% 45·2% 51·0% 22·0% 21·7% 22·3% 6·2% 6·8% 5·5% 6·9% 6·6% 7·2% 73·4% 56·0% 48·0% 76·0% 57·9% 49·3% 66·6% 50·8% 44·7% 52·8% 77·2% 60·8% 80·1% 54·2% 63·0% 69·6% 54·9% 49·0% 74·7% 56·4% 49·5% 78·5% 59·2% 52·0% 64·7% 49·1% 43·0% 23·3% 23·3% 23·3% 9·8% 9·2% 10·6% 7·2% 6·7% 7·8% 82·7% 66·8% 58·8% 65·1% 82·2% 56·5% 61·8% 83·3% 69·2% 27·7% 27·9% 27·4% 87·6% 87·5% 87·7% 11·8% 11·2% 12·7% 8·4% 7·9% 9·2% 30·4% 12·7% 30·9% 12·1% 29·8% 13·7% 8·8% 8·3% 9·5% 75·1% 69·2% 74·6% 68·7% 75·8% 69·9% 90·0% 80·3% 75·8% 89·7% 80·4% 75·8% 75·8% 90·3% 80·2% 58·8% 37·5% 58·6% 36·8% 59·0% 38·4% 28·1% 27·6% 28·8% 65·8% 44·9% 65·7% 44·2% 45·8% 66·0% 48·4% 47·8% 49·0% 47·3% 48·6% 45·6% 54·7% 54·3% 55·2% 17·2% 17·2% 17·1% 23·6% 23·7% 23·5% 36·5% 35·2% 37·9% 8·6% 9·0% 8·1% 14·9% 14·4% 15·6% 32·0% 30·7% 33·4% 57·4% 57·4% 57·3% 22·0% 22·2% 21·8% 57·8% 34·0% 59·4% 34·4% 55·8% 33·6% 35·2% 54·5% 52·6% 31·9% 56·6% 39·0% 35·2% 34·5% 35·9% 10·8% 11·1% 10·3% 24·0% 23·6% 24·6% 30·2% 26·9% 33·9% 70·1% 52·3% 43·9% 70·0% 51·6% 43·4% 53·2% 44·6% 70·2% 27·7% 64·5% 14·3% 65·7% 28·8% 15·1% 63·2% 26·4% 13·4% 63·8% 41·6% 32·3% 65·6% 42·4% 32·3% 61·4% 40·5% 32·2% 56·6% 37·1% 32·5% 55·0% 33·7% 29·2% 58·4% 41·0% 36·3% 73·5% 77·6% 63·0% 34·7% 35·3% 33·9% 90·8% 90·3% 91·4% 74·3% 74·4% 74·3% 70·6% 71·8% 69·3% 66·3% 68·3% 63·7% 59·7% 58·7% 60·9% 54·8% 49·2% 57·8% 52·4% 47·0% 40·9% 72·4% 53·4% 49·5% 76·6% 56·5% 53·5% 61·4% 45·3% 39·1% 15·0% 10·6% 14·2% 9·9% 16·1% 11·5% 40·1% 18·5% 13·5% 41·1% 17·8% 12·8% 38·8% 19·5% 14·5% 82·9% 78·3% 82·5% 77·2% 83·4% 79·7% 91·4% 85·0% 80·0% 90·8% 84·1% 77·7% 92·3% 86·3% 83·1% 59·7% 52·6% 59·1% 51·9% 60·5% 53·3% 79·6% 68·8% 63·1% 79·8% 68·1% 62·2% 79·4% 69·5% 64·1% 36·0% 21·4% 37·9% 23·5% 34·0% 19·2% 76·7% 47·0% 32·6% 78·0% 50·0% 36·8% 75·3% 43·8% 27·9% 46·4% 38·7% 47·7% 39·4% 44·6% 37·8% 68·6% 51·5% 46·1% 70·7% 53·3% 47·6% 65·9% 49·1% 44·2% 40·6% 36·6% 37·8% 33·9% 43·9% 39·7% 63·5% 45·2% 41·9% 63·1% 43·3% 40·1% 63·9% 47·5% 44·0% 38·4% 34·8% 40·4% 36·9% 36·2% 32·5% *Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men. Table 2: 40-year trends in the index of net survival for all cancers combined at 1, 5, and 10 years after diagnosis in adults (15–99 years) in England from 1971 to 2011 and trends in the age-adjusted net survival for 21 selected cancers in England from 1971 to 2011 by sex of the small number of patients, and broader age groups were constructed to re-estimate survival for those combinations. Role of the funding source The funder had no role in study design, quality control, analysis, interpretation of the results, drafting, or the decision to submit for publication. The corresponding author had full access to all data and was responsible for the decision to publish. Results The index of net survival for all cancers combined at 1, 5, and 10 years since diagnosis increased substantially between 1971 and 2011 in England and Wales (ﬁ gure 1, tables 2 and 3). The all-cancers survival index was 50% at 1 year after diagnosis for patients diagnosed in 1971–72. For patients diagnosed during 2005–06, the index was 50% at 5 years after diagnosis, and for patients diagnosed during 2010–11, we predict that the all-cancers survival index will reach 50% at 10 years after diagnosis. For patients diagnosed during 2010–11, the survival index for all cancers combined had reached 69–70% at 1 year and a predicted value of 54% at 5 years for both sexes combined. The 5-year survival index rose by 24% (from 30% to 54%) and the 10-year survival index by 26% (from 24% to 50%) between the periods 1971–72 and 2010–11. Most of the increase occurred between 1990 and 2011. The survival index for all cancers combined is on average 10% higher for women than for men at each time interval since diagnosis. The pattern of increase in the index was fairly similar for both men and women during the whole period, although the increase was linear for women but it became steeper for men after 1990–91. For patients diagnosed during 2010–11, the all- cancers survival index for women in England was 74% at 1 year, 59% at 5 years, and 54% at 10 years, whereas the ﬁ gures for men were 67% at 1 year, 49% at 5 years, and 1210 1971–72 1980–81 1990–91 2000–01 2005–06 2010–11 (prediction) 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 48·1% 28·9% 23·4% 42·9% 24·8% 20·4% 53·2% 32·9% 26·3% 53·6% 34·7% 48·2% 28·8% 58·9% 40·5% 28·9% 23·7% 34·1% 58·4% 40·4% 34·6% 53·2% 33·9% 28·1% 63·4% 46·8% 41·1% 63·2% 46·9% 41·6% 59·1% 41·5% 35·9% 67·2% 52·2% 47·2% 66·3% 50·6% 46·0% 62·7% 45·7% 41·0% 69·9% 55·5% 51·0% 69·4% 54·2% 65·9% 49·2% 72·8% 59·0% 50·2% 45·5% 54·8% 9·5% 8·7% 10·8% 15·2% 15·3% 15·0% 16·9% 17·9% 15·2% All cancers combined All patients Men Women Oesophagus All patients Men Women Stomach All patients Men Women Colon All patients Men Women Rectum All patients Men Women Pancreas All patients Men Women Larynx Men Lung All patients Men Women Melanoma of skin All patients Men Women Breast Women Cervix Women Uterus Women Ovary Women Prostate Men Testis Men Kidney All patients Men Women 15·6% 14·6% 17·4% 5·2% 5·1% 5·4% 5·7% 5·6% 6·0% 4·1% 3·8% 4·8% 4·6% 4·5% 4·9% 42·7% 25·0% 22·8% 43·1% 26·5% 24·5% 42·2% 23·4% 21·2% 50·8% 22·9% 19·7% 50·6% 21·4% 17·9% 51·0% 25·2% 22·1% 18·7% 19·1% 18·1% 6·0% 5·8% 6·4% 10·1% 21·3% 9·7% 21·0% 21·9% 10·8% 51·8% 33·3% 51·9% 33·3% 51·8% 33·3% 58·5% 31·2% 58·7% 29·9% 58·1% 33·1% 12·2% 11·5% 12·9% 3·8% 4·0% 3·7% 2·4% 2·7% 2·1% 12·8% 13·0% 12·5% 4·6% 5·6% 3·7% 5·2% 4·9% 5·6% 8·9% 8·6% 9·4% 30·9% 30·9% 31·0% 27·7% 26·1% 30·0% 3·4% 4·6% 2·3% 22·8% 23·2% 22·1% 6·9% 7·0% 6·8% 24·7% 10·8% 25·0% 10·6% 11·2% 24·1% 5·8% 6·0% 5·5% 9·2% 9·1% 9·3% 30·7% 32·8% 27·4% 8·8% 9·3% 7·9% 6·7% 7·1% 6·1% 35·5% 10·6% 37·7% 10·9% 32·1% 10·3% 7·9% 7·8% 8·0% 39·7% 42·3% 35·8% 12·9% 12·7% 13·3% 30·9% 12·6% 32·3% 12·6% 12·7% 28·5% 9·9% 9·9% 10·1% 36·5% 15·5% 12·0% 38·2% 15·5% 11·7% 33·5% 15·6% 12·4% 19·5% 19·4% 14·9% 43·1% 14·4% 45·0% 39·6% 19·6% 16·0% 58·4% 39·8% 37·2% 60·0% 40·3% 37·4% 56·9% 39·2% 37·0% 63·2% 45·2% 42·4% 65·8% 46·6% 43·3% 60·5% 43·7% 41·6% 67·8% 50·9% 48·3% 70·2% 51·8% 48·5% 65·4% 49·9% 48·0% 73·0% 74·9% 71·1% 57·7% 55·4% 57·9% 54·9% 57·5% 55·8% 65·7% 40·6% 37·1% 66·5% 39·8% 35·9% 64·6% 41·7% 38·9% 72·4% 50·0% 46·7% 73·2% 49·5% 45·8% 71·4% 50·8% 48·0% 75·2% 54·4% 51·3% 76·1% 54·1% 50·6% 74·0% 54·8% 52·3% 55·6% 77·7% 58·5% 78·6% 58·4% 55·1% 76·4% 58·6% 56·4% 12·9% 13·5% 12·4% 4·2% 5·0% 3·3% 2·8% 3·7% 2·0% 14·0% 14·8% 13·3% 3·0% 3·4% 2·6% 1·5% 1·8% 1·3% 16·3% 16·7% 15·8% 3·0% 3·4% 2·7% 1·3% 1·5% 1·2% 19·0% 19·4% 18·6% 3·3% 3·7% 2·9% 1·2% 1·4% 1·1% 77·7% 56·3% 45·9% 82·5% 64·8% 55·6% 82·1% 63·9% 54·5% 80·2% 60·4% 50·4% 81·4% 63·3% 53·7% 84·0% 68·1% 59·5% 5·1% 4·2% 6·6% 3·6% 2·8% 5·1% 18·7% 18·6% 18·8% 7·2% 7·2% 7·0% 5·5% 5·6% 5·3% 19·7% 19·5% 20·1% 6·8% 6·7% 6·9% 4·7% 4·6% 4·9% 21·5% 21·1% 22·2% 5·9% 5·5% 6·6% 3·3% 2·9% 4·0% 25·5% 24·4% 27·4% 6·9% 6·3% 8·0% 3·6% 3·1% 4·3% 31·1% 28·8% 35·2% 8·6% 7·7% 10·3% 4·2% 3·7% 5·1% 79·9% 51·1% 44·0% 73·8% 38·9% 33·3% 84·4% 60·1% 52·0% 74·9% 47·9% 34·8% 73·9% 52·8% 47·4% 72·7% 55·9% 53·4% 48·2% 22·2% 18·0% 62·7% 36·6% 27·8% 82·9% 69·5% 66·2% 43·7% 29·0% 24·4% 44·8% 30·6% 25·3% 41·9% 26·4% 22·9% 82·3% 63·1% 76·6% 51·0% 86·5% 57·2% 44·6% 72·0% 66·6% 85·6% 71·4% 66·3% 81·8% 62·5% 55·9% 88·3% 78·0% 73·9% 77·5% 91·3% 72·9% 89·4% 71·0% 65·8% 92·7% 82·2% 78·1% 94·4% 82·4% 77·6% 93·1% 76·4% 68·9% 95·3% 86·7% 84·1% 96·8% 89·0% 82·1% 95·8% 83·7% 68·3% 97·6% 92·9% 92·2% 81·8% 60·3% 48·5% 87·4% 71·7% 62·3% 91·4% 80·4% 73·4% 93·0% 83·8% 77·9% 94·3% 86·7% 81·8% 80·0% 63·2% 57·8% 78·6% 59·9% 55·0% 78·5% 59·9% 55·2% 79·7% 62·4% 57·5% 81·7% 65·6% 60·3% 76·2% 61·7% 56·8% 80·6% 67·0% 62·2% 85·3% 72·4% 69·6% 88·1% 76·8% 73·9% 90·5% 81·2% 77·8% 52·0% 26·2% 21·8% 56·9% 31·4% 26·6% 61·1% 36·6% 31·8% 63·1% 39·2% 34·4% 65·1% 41·9% 37·1% 65·9% 35·9% 25·6% 72·9% 44·6% 32·9% 85·0% 68·8% 59·1% 90·1% 79·8% 74·9% 93·7% 87·1% 87·1% 89·9% 81·1% 80·0% 94·4% 89·7% 89·1% 97·1% 95·0% 94·1% 97·4% 96·0% 94·4% 97·4% 96·6% 93·9% 46·9% 31·0% 48·5% 32·0% 44·2% 29·5% 25·1% 25·6% 24·4% 53·0% 36·5% 29·7% 54·6% 37·2% 30·0% 50·3% 35·4% 29·2% 61·6% 46·2% 39·5% 62·3% 46·9% 39·9% 60·5% 45·1% 38·7% 66·6% 51·2% 44·0% 67·6% 51·4% 43·5% 64·8% 50·8% 44·8% 70·8% 72·2% 68·5% 55·2% 47·3% 53·9% 44·2% 57·3% 52·4% (Table 3 continues on next page) 1211 1971–72 1980–81 1990–91 2000–01 2005–06 2010–11 (prediction) 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years 1 year 5 years 10 years *Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men Table 3: 40-year trends in the index of net survival for all cancers combined at 1, 5, and 10 years after diagnosis in adults (15–99 years) in Wales from 1971 to 2011 and trends in the age-adjusted net survival for 21 selected cancers in Wales from 1971 to 2011 by sex 46% at 10 years. Both the levels and the trends in the all- cancers survival index were similar in England and Wales. The average absolute diﬀ erence between the two countries was less than 1% (ﬁ gure 1, tables 2 and 3). Survival for both sexes combined varied widely for diﬀ erent cancers, with the most recent predicted 10-year net survival adjusted for age and sex ranging from only 1·1% for pancreatic cancer to 98·2% for testicular cancer. A scatter-plot of the 1-year, 5-year, and 10-year survival estimates for adults diagnosed in 2010–11 against the absolute change since 1971 enables three broad clusters of cancers to be identiﬁ ed (ﬁ gure 2). The ﬁ rst cluster consists of cancers with high survival in 2010–11 for which the absolute increase in survival since 1971–72 is progressively larger for survival at 1, 5, and 10 years. It includes cancers of the breast, prostate, testis, and uterus, and melanoma and Hodgkin’s disease. The second cluster is of cancers with a moderate level of survival (64–84%) in 2010–11 and, generally, smaller increases since 1971–72. This cluster consists of cancers of the larynx, cervix, rectum, colon, bladder, ovary, and kidney, with non-Hodgkin lymphoma, multiple myeloma, and leukaemia. For multiple myeloma and leukaemia, age-adjusted 10-year survival rose by more than 22% between the periods 1990–91 and 2010–11, from around 10·8% to a predicted 32·6% for multiple myeloma and from 24·0% to 46·1% for leukaemia (table 2). The third cluster is of cancers for which survival for patients diagnosed during 2010–11 is still low, and for which little or no improvement has occurred in the past 40 years: this group consists of malignancies of the brain, stomach, lung, oesophagus, and pancreas. This clustering can be seen as early as 1 year after diagnosis, and each cancer is in the same cluster, irrespective of the time since diagnosis (and the nation). We observed the largest absolute change in the age- adjusted survival for multiple myeloma, leukaemia, and prostate cancer. 1212 Testis Breast Uterus Melanoma Hodgkin’s Prostate Larynx Cervix Bladder Rectum NHL Colon Kidney Myeloma Leukaemia Ovary Other cancers Brain Stomach Oesophagus Lung Testis Hodgkin’s Breast Uterus Melanoma Prostate Larynx Cervix NHL Colon Rectum Kidney Bladder Leukaemia Myeloma Other cancers Ovary Brain Stomach Oesophagus Lung Pancreas Testis Melanoma Hodgkin’s Uterus Breast Prostate Larynx Cervix Bladder Colon Kidney NHL Rectum Leukaemia Other cancers Myeloma Ovary Brain Stomach Oesophagus Lung Pancreas Testis Melanoma Uterus Hodgkin’s Breast Prostate Testis Prostate Melanoma Hodgkin’s Breast Uterus Larynx Cervix Colon Kidney NHL Rectum Bladder Other cancers Leukaemia Ovary Myeloma Cervix Larynx Bladder Colon NHL Rectum Kidney Leukaemia Other cancers Ovary Myeloma Brain Stomach Oesophagus Lung Pancreas Brain Stomach Oesophagus Lung Pancreas –5 0 5 15 25 35 45 55 65 –5 0 5 Absolute change since 1971 (%) 15 25 35 45 Absolute change since 1971 (%) 55 65 1213 England 100 Wales Cluster 1 Testis Breast Melanoma Hodgkin’s Uterus Larynx Cervix Prostate Rectum NHL Myeloma Colon Ovary Leukaemia Bladder Kidney Other cancers Cluster 2 Brain Oesophagus Stomach Lung Pancreas Cluster 2 Pancreas lung cancer has 1-year survival from improved substantially, from 16% in 1971–72 to 32% in 2010–11. However, estimated long-term survival for patients diagnosed in 2010–11 is very poor for both sexes: as low as 10% at 5 years and 4% and 7% in men and women, respectively, at 10 years. This overall pattern of no improvement in long-term survival is common in the cluster of poor-prognosis cancers (oesophagus, stomach, pancreas, and brain), for men and women and for both England and Wales. Survival for breast cancer has seen a rapid and substantial improvement during the past 40 years. 5-year survival increased from 53% in 1971–72 to a predicted value of 87% in 2010–11. After 10 years, survival rose from 40% in 1971–72 to a predicted 78% for patients diagnosed during 2010–11. Diﬀ erences between 5-year and 10-year survival estimates remained broadly constant since 1971, showing that most of the improvements in long-term survival arose in the ﬁ rst 5 years after diagnosis. Breast cancer accounted for nearly a third of all cancers in women, which partly explains the higher all-cancers survival index in women than in men. Although survival from cancers of the colon and rectum is much lower than survival from breast cancer (around 20% lower in 2010–11), the trends in 1-year, 5-year, and 10-year survival for these two cancers have followed an almost identical pattern to that of breast cancer during the past 40 years. For men diagnosed with prostate cancer during 2010–11, the predicted values for 5-year and 10-year estimates are almost identical at 85% and 84%, respectively, which are huge increases from the values of 37% and 25% for men diagnosed 40 years ago. The trends are quite distinct for short-term, medium-term, and long-term survival. In both England and Wales, 1-year survival has been increasing since 1971–72, whereas acceleration in 5-year survival started for men diagnosed in the 1980s; 10-year survival only began increasing for men diagnosed in the 1990s. For women diagnosed with cancer of the ovary during 2010–11, the age-adjusted survival was predicted as 46% at 5 years and 35% at 10 years compared with 20% and 18%, respectively, for women diagnosed during 1971–72. These results suggest that the underlying increase in survival of up to 5 years is likely to continue. Net survival is generally lower for the oldest patients (75–99 years) than the youngest (15–44 years), even though net survival accounts for a higher mortality from causes other than cancer in elderly patients. This ﬁ nding is shown by a scatter-plot of the age gap in net survival at 1, 5, and 10 years after diagnosis for adults diagnosed in Figure 2: Net survival adjusted for age and sex for each cancer in 2010–11, and absolute change* since 1971, all adults (15–99 years), England and Wales: 1, 5, and 10 years after diagnosis *The absolute change is the simple arithmetic diﬀ erence between net survival in 2010–11 and the survival in 1971–72. NHL=non-Hodgkin lymphoma. England Wales Testis Melanoma Prostate Melanoma Colon Larynx Hodgkin’s Rectum Stomach NHL Bladder Testis Pancreas Oesophagus Leukaemia Kidney Myeloma Lung Other cancers Prostate Oesophagus Larynx Colon Rectum Stomach Lung NHL Kidney Bladder Myeloma Pancreas Leukaemia Other cancers Hodgkin’s Brain Brain Prostate 1 1 0 2 –20 Testis Colon Oesophagus Pancreas Stomach Lung Melanoma Larynx Rectum Kidney Leukaemia NHL Bladder Prostate Other cancers Myeloma Brain Hodgkin’s 0 –10 1 1 0 Pancreas Larynx Rectum Hodgkin’s NHL Bladder Colon Melanoma Oesophagus Testis Stomach Lung Kidney Leukaemia Myeloma Other cancers Brain Colon Oesophagus Stomach Pancreas Rectum Prostate Melanoma Lung Larynx Testis Brain Kidney Other cancers Hodgkin’s Bladder NHL Testis Pancreas Oesophagus Stomach Lung Colon Rectum Melanoma Larynx 1 1 0 2 –20 Prostate –30 –40 Leukaemia Kidney NHL Brain Bladder Myeloma –50 –60 –70 –80 Other cancers Myeloma Leukaemia Hodgkin’s –50 –40 –30 –20 –10 0 10 20 30 40 –50 –40 –30 –20 –10 0 10 20 30 40 Increase in age gap Decrease in age gap Increase in age gap Decrease in age gap Absolute change in age gap since 1971 (%) Absolute change in age gap since 1971 (%) 2010–11 against the absolute change since 1971–72: it shows a negative gap in survival for most cancers (y-axis of ﬁ gures 3 and 4). The largest age gaps in survival in men were observed for cancers for which high-dose chemotherapy is the key treatment (lymphoma, multiple myeloma, and leukaemia), but we could not identify any overall temporal patterns. For women, the largest age gaps were noted for brain tumours, and cancers of the ovary and cervix, and multiple myeloma, but the clustering was less obvious than in men. The age gap tended to narrow for melanoma and cancer of the uterus in women but widened for long-term survival of ovarian cancer. Discussion The index of net survival for all cancers combined has increased substantially: for patients diagnosed in 1971–72, the index was 50% at 1 year after diagnosis. Our prediction is that, for patients diagnosed during 2010–11, the all-cancers survival index will reach 50% at 10 years after diagnosis. Very similar patterns of change and levels of survival were noted in both England and Wales. Survival has increased steadily during the 40 years since 1971, with a slight acceleration in the past 10–15 years, particularly for 5-year and 10-year survival, in both England and Wales. After implementation of the NHS cancer plan for England,19 we reported a slight acceleration in the 1-year cancer survival trends during 2004–06, by contrast with Wales,2 where a national cancer plan was only introduced in 2006. The pattern was not so clear for survival at 3 years after diagnosis. The ﬁ ndings reported here suggest a continuing acceleration of these trends for longer-term survival between 2005–06 and 2010–11 in England, but also in Wales (panel). The completeness and quality of cancer registration and follow-up data in both England and Wales have been systematically assessed and are thought to be very high throughout the period 1971–2011, despite undeniable improvement during the 1970s–80s.21–23 This improvement cannot explain long-term trends in cancer survival.24,25 Furthermore, with the exception of bladder cancer, overall changes in disease deﬁ nitions are limited, even for haemopoietic malignancies. To aﬀ ect the survival index, such a change in disease deﬁ nition would need to aﬀ ect a substantial proportion of all cancers, for which prognosis would also need to be very diﬀ erent from that for other cancers. These conditions are not met. In some strata deﬁ ned by age, sex, cancer, and calendar period of diagnosis, especially in Wales, few deaths Figure 3: Age gap* in net survival by cancer, men (15–99 years) diagnosed during 2010–11 versus absolute change† in the age gap since 1971, England and Wales: 1, 5, and 10 years after diagnosis *The age gap represents the absolute diﬀ erence (%) between net survival in the oldest (75–99 years) and youngest (15–45 years) groups of patients; a negative value means that survival is lower in the oldest group than the youngest group. †The absolute change is the simple arithmetic diﬀ erence between the age gap in 2010–11 and the age gap in 1971–72. NHL=non-Hodgkin lymphoma. 1214 England Wales Melanoma Breast Melanoma Breast Uterus Rectum Hodgkin’s Myeloma Leukaemia Stomach Lung Colon NHL Bladder Kidney Oesophagus Pancreas Other cancers Cervix Ovary Stomach Oesophagus Pancreas Myeloma Lung Bladder Kidney Other cancers Uterus Hodgkin’s Leukaemia Rectum Colon NHL Cervix Ovary Brain Brain Breast Colon Melanoma Pancreas Stomach Oesophagus Lung Uterus Rectum Leukaemia NHL Kidney Bladder Hodgkin’s Myeloma Other cancers Brain Ovary Cervix Pancreas Breast Melanoma Hodgkin’s Uterus Leukaemia Colon Rectum Oesophagus Lung Stomach Myeloma NHL Kidney Bladder Cervix Ovary Other cancers Brain Colon Melanoma Rectum Uterus Pancreas Breast Lung Stomach Oesophagus Pancreas Breast Melanoma Colon Rectum Lung Oesophagus Stomach Hodgkin’s Leukaemia Uterus –30 –40 –50 Bladder Leukaemia NHL Kidney Brain Other cancers Hodgkin’s –60 Myeloma Ovary Cervix Myeloma Bladder NHL Ovary Brain Kidney Cervix Other cancers –70 –80 –50 –40 –30 –20 –10 0 10 20 30 40 –50 –40 –30 –20 –10 0 10 20 30 40 Increase in age gap Decrease in age gap Increase in age gap Decrease in age gap Absolute change in age gap since 1971 (%) Absolute change in age gap since 1971 (%) 0 occurred. To obtain more stable net survival estimates, we therefore estimated net survival using a modelling approach rather than the non-parametric Pohar-Perme approach.20 The index of net survival for all cancers combined provides one convenient number that summarises the overall patterns of cancer survival in any one population or country, in each calendar period for young and old men and women and for a wide range of cancers with very disparate survival. The index is unaﬀ ected by changes in the proportion of cancers of diﬀ erent lethality in either sex, such as the reduction of lung cancer or the increase in prostate cancer in men. Similarly, the index is unaﬀ ected by ageing of the population of patients with cancer or shifts in the proportion of any cancer between men and women. The value of the index changes only when survival for one or more cancers changes, for one or more age groups. The index therefore shows overall progress in cancer management, whether from earlier diagnosis, or earlier stage of disease, or improved treatment and care. However, the all-cancers survival index needs careful interpretation: for example, the predicted value of 50% for the 10-year all-cancers survival index for 2010–11 does not mean that half of all patients will be cured or “beat cancer”, as has been portrayed in the media.26 The index is designed as a public health measure that summarises cancer survival trends in an entire population, to help to assess progress in the overall eﬀ ectiveness of the health system in diagnosis and management of patients with cancer. The index does not reﬂ ect the prospects of survival for any individual patients with cancer. The index is based on net survival, which is an unbiased measure of population- based survival from cancer after adjustment for other causes of death. Net survival is the most valid available metric for comparison of survival between populations and for assessment of progress in cancer survival over time. The all-cancers net survival index should nevertheless be interpreted in conjunction with other information available in the population or country for which the index has been prepared. It should be seen as a guide to raise questions about the potential for improvement. The average 10% diﬀ erence in the survival index between men and women has been a consistent feature for 40 years. It arises because, for several individual cancers, survival is slightly higher for women, but mostly because the cancers that are most common in women, such as breast cancer (weight of 0·31 in the survival index for women), generally have higher survival than the cancers that are most common in men, such as lung Figure 4: Age gap* in net survival by cancer, women (15–99 years) diagnosed during 2010–11 versus absolute change† (%) in the age gap since 1971, England and Wales: 1, 5, and 10 years after diagnosis *The age gap represents the absolute diﬀ erence (%) between net survival in the oldest (75–99 years) and youngest (15–45 years) groups of patients; a negative value means that survival is lower in the oldest group than the youngest group. †The absolute change is the simple arithmetic diﬀ erence between the age gap in 2010–11 and the age gap in 1971–72. NHL=non-Hodgkin lymphoma. 1215 See Online for appendix Panel: Research in context Systematic review Health policy measures to improve the organisation and delivery of services for the prevention, diagnosis, and treatment of cancer should be based on sound evidence. Population-based survival trends have proved to be a key metric for the overall eﬀ ectiveness of health systems. An unbiased estimator of net survival was introduced in 2012.20 We have not undertaken a literature review, but so far, only a few countries have published population-based cancer survival using this estimator, including in England by our research group.12 No other country has constructed a single, summary index of net survival for all cancers combined. A simple, robust, one-number index of net survival for all cancers combined can contribute to the evidence base for rational health policy. Interpretation Changes in the net survival index reﬂ ect changes in survival for one or more cancers, not simply changes in the distribution of cancer patients by age, cancer site, or sex. The net survival index increased substantially between 1971 and 2011, representing a substantial gain in overall survival from all cancers combined. Net survival varied widely for diﬀ erent cancers, and was generally lower for older patients than younger patients, even after adjustment for the higher mortality from other causes in older patients. Three clusters of cancers, with high, moderate, and low survival, can be distinguished as early as 1 year after diagnosis. Overall, the survival trends are encouraging in both England and Wales, but they also suggest strongly the need for renewed eﬀ orts to achieve better outcomes. cancer (weight of 0·22 in the index for men). The slight narrowing in the sex gap observed in the most recent periods might be explained by the rapid increase in survival for prostate cancer (weight of 0·19 in the index for men), particularly at 5 and 10 years after diagnosis. This rapid increase in survival for prostate cancer has been largely attributed to the widespread use of prostate-speciﬁ c antigen (PSA) testing, resulting in the diagnosis of many less advanced tumours with a shift of the stage distribution to less advanced and less aggressive disease. However, importantly, survival had already started to increase, albeit more slowly, much before PSA testing was widely used.27 The more recent increase in long-term survival suggests that this improvement is not simply because of a shift in the stage distribution after increasingly wide use of the PSA test. The increase in short-term survival, which began as early as the 1970s, and the increase in 5-year survival in the 1980s and then in the 10-year survival in the following decade cannot simply be attributed to PSA. We were able to group the 21 most common cancers into three clusters on the basis of their survival. Despite some large gains in survival, these clusters are, with few exceptions, the same in 2011 as in 1971 (data not shown). The clusters are identiﬁ able as early as 1 year after diagnosis, and they are consistent at 5 and 10 years after diagnosis, both in England and Wales. Cluster 1 includes cancers with a good prognosis: survival is now very high, after a large increase since 1971, particularly at 5 and 10 years after diagnosis. 1-year survival seems to have reached a ceiling for most of these cancers, but survival at 5 and 10 years is still much lower than at 1 year for breast cancer and Hodgkin’s disease. The absence of any plateau in survival, even 10 years after diagnosis, shows that cure at the population level has still not been reached for these cancers, leaving room for substantial further improvement in long-term survival. For most cancers in the other two clusters, survival at 5 and 10 years after diagnosis is still much lower than 1-year survival. The second cluster consists of a further mix of cancers for which either survival has remained moderate since the early 1970s, or moderate levels of survival in 2011 are the result of large improvements during the past 40 years. The second situation is well illustrated by the steep increase in survival from multiple myeloma since 2000–01, probably explained by the introduction of higher-dose treatment regimens around 2000. For the cancers in this cluster that have shown no evidence of improvement, eﬀ orts should be made to achieve earlier diagnosis, and to focus on stricter guidelines for improved treatment, such as increased use of surgery, radiotherapy with curative intent, neoadjuvant therapies, or a combination of the three. The eﬀ ect of mass-screening on survival varies with the cancer. For cervical cancer, an eﬃ cient screening programme does not necessarily lead to an improvement in survival because screening prevents the occurrence of invasive tumours, thereby reducing incidence, and the remaining patients are, on average, diagnosed with more advanced disease.28 A quasi-plateau in 1-year survival has been observed since 2000–01 (appendix 1 and 2). By contrast, breast cancer screening aims to diagnose the disease at an early stage, rather than to prevent it. Its real eﬀ ect on survival has been questioned mainly because of possible overdiagnosis and lead time. However, overdiagnosis does not exceed a few percent,29 and the advantage in survival remains important for screen- detected breast cancer after accounting for lead time.30 Improvement in breast cancer survival has been large because of both early diagnosis and improved treatment, although net survival continues to decrease even 10 years after diagnosis, showing late recurrences. The age gap in survival has also decreased, supporting more rapid improvement in survival for older women (and for the screened age group) than in younger women.31 to prevent invasive malignant Screening for colorectal cancer, which started in 2006, aims tumours (by removing polyps with adenoma tous change) and to diagnose cancer at an early stage. Therefore, although it is too recent to have any eﬀ ect on these results, lessons from both cervical and breast cancer screening 1216 programmes will also help us to monitor the eﬀ ect of screening on the prognosis of colorectal cancer. in young patients A wide age gap in survival was still present for most cancers in 2010–11. Some of these diﬀ erences are related to screening or early diagnostic practices (breast, cervix, prostate). Also, the disease, and its prognosis, might radically diﬀ er by age, such as leukaemia: the treatment of acute disease improved substantially, by contrast with chronic leukaemia in elderly patients, but separation of both diseases is not possible over the entire period 1971–2011. However, in other countries, the age gap in cancer survival is much narrower than in England and Wales.32,33 The wide age- related inequalities in cancer survival in England and Wales are thus likely to be avoidable. They could be substantially reduced. 1-year survival has improved substantially for cancers with a particularly poor prognosis (cluster 3), but longer- term survival (5 and 10 years after diagnosis) has hardly changed during the past four decades. Among these cancers, substantial improvements should be achievable for lung cancer: in 2011, National Institute for Health and Care Excellence (NICE) guidelines34 underlined the need for improved staging and increased widespread access to surgery and radiotherapy with curative intent for non- small-cell lung cancer. Adherence to these guidelines and their eﬀ ect on cancer outcomes has not yet been exhaustively assessed.35 In summary, despite impressive overall improvements in cancer survival during the past 40 years in both England and Wales, the wide and persistent diﬀ erences in survival between cancers, together with the wide and persistent age gap in survival for most cancers, suggest the need for renewed eﬀ orts to achieve improved outcomes, particularly in elderly patients. The ﬁ ndings reported here oﬀ er clues for focused research to dissect the underlying causes of these diﬀ erences in cancer survival. The results should prompt action to improve public health in both England and Wales. This research will need systematic linkage of clinical audit streams and other detailed data streams to population-based cancer registry data, but the recent crisis of public concern about the sharing of individual health data for conﬁ dential public health research will need to be resolved ﬁ rst.36 Contributors MQ did the analysis. MQ and BR designed the analytic strategies and constructed the indexes. MQ, MPC, and BR wrote the Article and interpreted the ﬁndings.

kbenoit · 2019-02-11T04:27:53Z

Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them.

But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using readtext() where the source file was a pdf?

MSU2580 · 2019-02-11T04:35:32Z

Hello, Specifying the UTF-8 encoding did help clear up many of the other issues we had when reading in the .txt file. However, I will see if we have access to the pdf files and go from there. I did utilize your work on the example pdf document and noted that it was effective, which may be the route we have to take. Thank you again. Best,

…

____________________________________________________ Keith L. Johnson, Ph. D. Montana State University Physics Department, Barnard Hall 226 PER Group

________________________________ From: Kenneth Benoit <notifications@github.com> Sent: Sunday, February 10, 2019 9:27:56 PM To: quanteda/readtext Cc: Johnson, Keith; Author Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146) Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them. But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using readtext() where the source file was a pdf? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#146 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AtNcbUMr3sfPHFjsiVtmVsNfEgRjZDkzks5vMPFMgaJpZM4ayygK>.

kbenoit · 2019-02-11T05:06:57Z

One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.)

txt <- "substantial investments in human and ﬁnancial 
resources for the eﬀect of earlier diagnosis of waﬄes."

cat(stringi::stri_trans_nfkc(txt))
## substantial investments in human and financial 
## resources for the effect of earlier diagnosis of waffles.

MSU2580 · 2019-02-25T03:00:17Z

Dr. Benoit Since last we communicated I have come across some pdf files (one example attached) which also seem to contain ligatures pertaining to "fi", "fl", etc. (as well as symbols such as lamba which do not concern us). I have again attached all the relevant data that you requested before. If you would like me to open up a separate case on github, let me know. Thank you for all your insight! The sessionInfo() data: R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readtext_0.71 quanteda_1.3.14 loaded via a namespace (and not attached): [1] Rcpp_0.12.16 magrittr_1.5 stopwords_0.9.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0 [9] fastmatch_1.1-0 stringr_1.3.0 httr_1.3.1 plyr_1.8.4 pdftools_2.1 tools_3.4.4 grid_3.4.4 data.table_1.10.4-3 [17] gtable_0.2.0 spacyr_1.0 yaml_2.1.18 lazyeval_0.2.1 RcppParallel_4.4.2 tibble_1.4.2 Matrix_1.2-12 ggplot2_2.2.1 [25] ISOcodes_2018.06.29 stringi_1.1.7 compiler_3.4.4 pillar_1.2.1 scales_0.5.0 lubridate_1.7.4 The code I utilized (as it pertains to just the attached file): library(quanteda) library(readtext) tempsci <- readtext("hep-th9910196.pdf") sciCorp <- corpus(tempsci) doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE) FinalSciCorp = textstat_frequency(doc_term_matrix) Best,

…

____________________________________________________ Keith L. Johnson, Ph. D. Montana State University Physics Department, Barnard Hall 226 PER Group

________________________________ From: Kenneth Benoit <notifications@github.com> Sent: Sunday, February 10, 2019 10:06:57 PM To: quanteda/readtext Cc: Johnson, Keith; Author Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146) One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.) txt <- "substantial investments in human and ﬁnancial resources for the eﬀect of earlier diagnosis of waﬄes." cat(stringi::stri_trans_nfkc(txt)) ## substantial investments in human and financial ## resources for the effect of earlier diagnosis of waffles. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#146 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AtNcbYh8pPX8wN55B86JtKTmJbgfXCfnks5vMPpxgaJpZM4ayygK>.

kbenoit · 2019-02-25T03:01:44Z

Thanks but please either upload them by dragging into the GitHub browser, or send them to me by email. (They did not show up above.)

MSU2580 · 2019-02-25T03:04:44Z

hep-th9910196.pdf

kbenoit · 2019-02-25T03:23:00Z

Thanks. That definitely contains the ligatures, and readtext::readtext() definitely does not normalize them. We can think about adding an option to readtext to do this automatically, but in the meantime, you can solve this "manually" using stringi:

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
rtxt <- readtext::readtext("~/Downloads/hep-th9910196.pdf")

texts(rtxt) %>%
  kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##                                                        
##  [hep-th9910196.pdf, 1523] the |  infinity   | so      
##  [hep-th9910196.pdf, 1614] the | ﬂuctuation  | (       
##  [hep-th9910196.pdf, 1812] the | diﬀerential | equation
##  [hep-th9910196.pdf, 1826] the | coeﬃcients  | of      
##  [hep-th9910196.pdf, 1987] the |   inﬁnity   | ,       
##  [hep-th9910196.pdf, 2928] the |   inﬁnity   | r       
##  [hep-th9910196.pdf, 3070] the |   inﬁnity   | ,       
##  [hep-th9910196.pdf, 3760] are |    cutoﬀ    | for

texts(rtxt) %>%
  stringi::stri_trans_nfkc() %>%
  kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##                                             
##  [text1, 1523] the |   infinity   | so      
##  [text1, 1614] the | fluctuation  | (       
##  [text1, 1812] the | differential | equation
##  [text1, 1826] the | coefficients | of      
##  [text1, 1987] the |   infinity   | ,       
##  [text1, 2928] the |   infinity   | r       
##  [text1, 3070] the |   infinity   | ,       
##  [text1, 3760] are |    cutoff    | for

kbenoit mentioned this issue Mar 11, 2019

Add Unicode normalization quanteda/quanteda#1641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

MSU2580 commented Feb 9, 2019 •

edited by kbenoit

kbenoit commented Feb 10, 2019

MSU2580 commented Feb 11, 2019 via email

kbenoit commented Feb 11, 2019

MSU2580 commented Feb 11, 2019 via email

kbenoit commented Feb 11, 2019

MSU2580 commented Feb 25, 2019 via email

kbenoit commented Feb 25, 2019

MSU2580 commented Feb 25, 2019

kbenoit commented Feb 25, 2019 •

edited

readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146

Comments

MSU2580 commented Feb 9, 2019 • edited by kbenoit

kbenoit commented Feb 10, 2019

MSU2580 commented Feb 11, 2019 via email

kbenoit commented Feb 11, 2019

MSU2580 commented Feb 11, 2019 via email

kbenoit commented Feb 11, 2019

MSU2580 commented Feb 25, 2019 via email

kbenoit commented Feb 25, 2019

MSU2580 commented Feb 25, 2019

kbenoit commented Feb 25, 2019 • edited

MSU2580 commented Feb 9, 2019 •

edited by kbenoit

kbenoit commented Feb 25, 2019 •

edited