-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. #146
Comments
Can you send me a link to one of the documents containing a ligature, as well as your |
Hello,
I apologize if you prefer another format (let me know), but the following should contain the information you desire. Note that attached is one of the documents which contains ligature issues.
The sessionInfo() data:
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readtext_0.71 quanteda_1.3.14
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 magrittr_1.5 stopwords_0.9.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0
[9] fastmatch_1.1-0 stringr_1.3.0 httr_1.3.1 plyr_1.8.4 tools_3.4.4 grid_3.4.4 data.table_1.10.4-3 gtable_0.2.0
[17] spacyr_1.0 yaml_2.1.18 lazyeval_0.2.1 RcppParallel_4.4.2 tibble_1.4.2 Matrix_1.2-12 ggplot2_2.2.1 ISOcodes_2018.06.29
[25] stringi_1.1.7 compiler_3.4.4 pillar_1.2.1 scales_0.5.0 lubridate_1.7.4
The code I utilized (as it pertains to just the attached file):
library(quanteda)
library(readtext)
tempsci <- readtext("H0101.txt",encoding = "UTF-8")
sciCorp <- corpus(tempsci)
doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)
FinalSciCorp = textstat_frequency(doc_term_matrix)
Thank you for your time, it is most appreciated!
Best,
…____________________________________________________
Keith L. Johnson, Ph. D.
Montana State University
Physics Department, Barnard Hall 226
PER Group
________________________________
From: Kenneth Benoit <notifications@github.com>
Sent: Sunday, February 10, 2019 3:55:53 PM
To: quanteda/readtext
Cc: Johnson, Keith; Author
Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146)
Can you send me a link to one of the documents containing a ligature, as well as your sessionInfo() output, so I can test it?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#146 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AtNcbUh-PkYWXZp0h5KRO3avv2IgqhWaks5vMKN5gaJpZM4ayygK>.
40-year trends in an index of survival for all cancers combined and survival adjusted for age and sex for each cancer in England and Wales, 1971–2011: a population-based study
Summary
Background Assessment of progress in cancer control at the population level is increasingly important. Population-based survival trends provide a key insight into the overall eff ectiveness of the health system, alongside trends in incidence and mortality. For this purpose, we aimed to provide a unique measure of cancer survival.
Methods In this observational study, we analysed trends in survival with population-based data for 7·2 million adults diagnosed with a first, primary, invasive malignancy in England and Wales during 1971–2011 and followed up to the end of 2012. We constructed a survival index for all cancers combined using data from the National Cancer Registry and the Welsh Cancer Intelligence and Surveillance Unit. The index is designed to be independent of changes in the age distribution of patients with cancer and of changes in the proportion of lethal cancers in each sex. We analysed trends in the cancer survival index at 1, 5, and 10 years after diagnosis for the selected periods 1971–72, 1980–81, 1990–91, 2000–01, 2005–06, and 2010–11. We also estimated trends in age-sex-adjusted survival for each cancer. We define the difference in net survival between the oldest (75–99 years) and youngest (15–44 years) patients as the age gap in survival. We evaluated the absolute change (%) in the age gap since 1971.
Findings The overall index of net survival increased substantially during the 40-year period 1971–2011, both in England and in Wales. For patients diagnosed in 1971–72, the index of net survival was 50% at 1 year after diagnosis. 40 years later, the same value of 50% was predicted at 10 years after diagnosis. The average 10% survival advantage for women persisted throughout this period. Predicted 10-year net survival adjusted for age and sex for patients diagnosed between 2010 and 2011 ranged from 1·1% for pancreatic cancer to 98·2% for testicular cancer. Net survival for the oldest patients (75–99 years) was persistently lower than for the youngest (15–44 years), even after adjustment for the much higher mortality from causes other than cancer in elderly people.
Interpretation These findings support substantial increases in both short-term and long-term net survival from all cancers combined in both England and Wales. The net survival index provides a convenient, single number that summarises the overall patterns of cancer survival in any one population, in each calendar period, for young and old men and women and for a wide range of cancers with very disparate survival. The persistent sex difference is partly due to a more favourable cancer distribution in women than men. The very wide differences in survival for different cancers, and the persistent age gap in survival, suggest the need for renewed eff orts to improve cancer outcomes. Future monitoring of the cancer survival index will not be possible unless the current crisis of public concern about sharing of individual data for public health research can be resolved.
Introduction
Cancer is an increasing public health concern, shown by
substantial investments
in human and fi nancial
resources for cancer management since the late 1990s.
Health policy measures have focused on improvement of
the organisation and delivery of services for prevention,
diagnosis, and treatment. Research has provided the
evidence base for these policies and is increasingly used
to assess their eff ect.1–7 The assessment of progress in
cancer control has become crucial. Population-based
cancer survival trends provide a key insight into the
overall eff ectiveness of the health system, alongside
incidence and mortality.8
In this population-based survival study, we analysed
cancer survival trends during the past four decades in
England and Wales using two metrics: an index of
survival for all cancers combined, and survival for each
cancer, adjusted for age and sex. The all-cancers survival
index was designed to provide one summary measure
of cancer survival that can be monitored over time to
show the overall progress in the eff ectiveness of the
health-care system. It was also designed to support
assessment of the eff ect of earlier diagnosis, which is a
key component of the National Awareness and Early
Diagnosis Initiative.9–11 Trends in survival for individual
cancers will underline those cancer types for which
1206
there has been progress and those for which prognosis
has remained poor.
Methods
Study design
Survival varies very widely with the age and sex of a
patient with cancer and with the type of cancer. The
frequency of diff erent cancers is also changing over time:
some cancers with poor prognosis, such as stomach and
lung cancer, have become less common, whereas breast
cancer in women, for which survival has been improving,
has become more common. These trends can diff er
between the sexes: lung cancer has become much less
common in men, but more common in women. The age
profi le of patients with cancer also changes over time,
and these trends can diff er between cancers. To enable
valid assessment of survival trends for all cancers
combined, the survival index must therefore take account
of changes over time in the distribution of age, sex, and
cancer type in all patients with cancer, especially over
periods as long as 40 years. Similarly, trends in survival
for each cancer must be adjusted for changes over time
in the age (and sex) profi le of patients with cancer.
Data sources
We examined survival trends in 7 176 795 adults (aged
15–99 years) diagnosed with a fi rst, primary, invasive
malignancy in England and Wales during 1971–2011, and
followed up to Dec 31, 2012 (table 1). Data for England
were obtained from the National Cancer Registry at the
Offi ce for National Statistics12 and for Wales from the
Welsh Cancer Intelligence and Surveillance Unit. Patients
diagnosed with a malignancy of the skin other than
melanoma were excluded. Since 1971, the National Health
Service Central Register has routinely updated these
individual cancer records with information about each
patient’s vital status (alive, emigrated, dead, or not traced).
The vital status at Dec 31, 2012, was known for 98·4% of
these patients. During the 41-year period, 4·3% of all
cancer registrations were for the patient’s second-order or
higher-order tumour: in the analyses for all cancers
combined, the higher-order cancers were not included.
Statistical analysis
The all-cancers survival index was constructed as a
weighted average of the survival estimates for every
combination of age group at diagnosis (15–44, 45–54,
ICD-10 code*
England
Oesophagus
Stomach
Colon
Rectum
Pancreas
Larynx (men)
Lung
Melanoma
Breast (women)
Cervix
Uterus
Ovary
Prostate
Testis
Kidney
Bladder
Brain
Hodgkin’s disease
Non-Hodgkin lymphoma
Myeloma
Leukaemia
Other cancers†
Total
C15
C16
C18
C19–C21
C25
C32
C33, C34
C43
C50
C53
C54, C55
C56, C57.0–7
C61
C62
C64–C66, C68
C67
C71
C81
C82–C85
C90
C91–C95
··
··
Women
Number
67 474
115 294
292 352
143 610
92 631
··
349 711
97 627
1 039 609
117 404
160 539
172 400
··
··
53 197
90 204
41 952
19 114
99 752
43 446
70 760
275 408
3 342 484
Men
%
Number
%
2·0%
3·4%
8·7%
4·3%
2·8%
··
10·5%
2·9%
31·1%
3·5%
4·8%
5·2%
··
··
1·6%
2·7%
1·3%
0·6%
3·0%
1·3%
2·1%
8·2%
100·0%
106 793
194 333
271 220
204 363
93 450
52 618
751 958
72 743
··
··
··
··
638 111
48 031
89 986
239 621
59 192
26 714
114 269
48 136
92 917
296 794
3 401 249
3·1%
5·7%
8·0%
6·0%
2·7%
1·5%
22·1%
2·1%
··
··
··
··
18·8%
1·4%
2·6%
7·0%
1·7%
0·8%
3·4%
1·4%
2·7%
8·7%
100·0%
Wales
Women
Number
4953
8627
17 711
9731
5868
··
21 027
5429
65 370
8272
10 836
11 051
··
··
3431
5897
2832
1145
5630
2805
4686
18 624
213 925
Men
%
Number
%
2·3%
4·0%
8·3%
4·5%
2·7%
··
9·8%
2·5%
30·6%
3·9%
5·1%
5·2%
··
··
1·6%
2·8%
1·3%
0·5%
2·6%
1·3%
2·2%
8·7%
100·0%
6857
14 299
17 736
14 358
6014
3529
45 601
4372
··
··
··
··
41 559
2743
5804
15 962
3786
1675
6320
3041
6112
19 369
219 137
3·1%
6·5%
8·1%
6·6%
2·7%
1·6%
20·8%
2·0%
··
··
··
··
19·0%
1·3%
2·6%
7·3%
1·7%
0·8%
2·9%
1·4%
2·8%
8·8%
100·0%
*Tenth revision of the International Classifi cation of Diseases (ICD): malignancies were initially coded according to the ICD revision in use during the year of diagnosis—ie, ICD
8 (1971–78), 9 (1979–95), or 10 (1996–). †Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men.
Table 1: Number of patients (aged 15–99 years) included in analyses in England and Wales diagnosed from 1971 to 2011 and followed up to 2012, by
sex and type of malignancy
1207
55–64, 65–74, and 75–99 years), sex (male and female), and
type of cancer (the 21 most common malignancies are
shown in table 1 and all other malignant tumours are
combined). The weights used were the proportion of
patients with cancer diagnosed in England and Wales
during 1996–99 in each of the 185 combinations of age
group, sex, and type of cancer. We also constructed the all-
cancers survival index separately for males and females
and estimated survival adjusted for age and sex by cancer.
All adults, England
All adults, Wales
1-year survival
5-year survival
10-year survival
Prediction
61
41
34
56
35
28
65
47
41
56
35
29
50
30
24
Men, England
51
30
23
45
25
20
Women, England
61
41
34
55
34
28
70
54
50
68
51
46
65
47
42
67
49
46
74
59
54
64
46
41
72
56
50
61
42
36
69
53
47
54
35
29
48
29
24
48
29
23
Men, Wales
43
25
20
Women, Wales
Period of diagnosis (year)
Period of diagnosis (year)
Figure 1: Trends in the index of net survival for all cancers combined, for England and for Wales: all adults
(15–99 years), men, and women, selected periods during 1971–2011
Net survival was used as the cancer survival measure
for each component of the indexes. Net survival
quantifi es the survival after taking account of death from
other causes (background mortality). All patients were
allocated a deprivation category defi ned according to
their Lower Super Output Area (mean population about
1500) of residence at the time of cancer diagnosis. Life-
tables were used to take account of the wide variation in
background mortality by age, sex, deprivation, region,
and over time. For this study, separate life-tables were
created for England and Wales by single year of age, sex,
deprivation category, and
(in England) region of
residence, for every calendar year between 1971 and
2012.13 National or regional life-tables were used for the
2·8% of patients diagnosed in England (2·6% in Wales)
who could not be assigned to a specifi c deprivation
category or (in England) region; almost all of these
patients were diagnosed in the 1970s (85% in England,
55% in Wales) or 1980s (14% England, 44% Wales).
We used fl exible multivariable parametric excess
hazard models14,15 to estimate net survival up to 10 years
after diagnosis for each nation, and for each stratum
defi ned by cancer, sex, age group, and calendar period.
The models included age and year of diagnosis as main
eff ects, modelled on a continuous scale with restricted
cubic splines, to account for potential non-linear excess
(cancer-related) hazards. Interactions between age and
year of diagnosis, year of diagnosis and follow-up time,
and age and follow-up time were assessed to deal with
potential variation of the excess hazard with time since
diagnosis. The best-fi tting models were chosen as those
with the smallest Akaike Information Criterion.16 Net
survival curves were estimated for each individual from
these models according to their age and year of diagnosis.
We obtained net survival estimates for each cancer and
sex by averaging of individual net survival curves, over all
ages and years of diagnosis within each age group and
calendar period. In view of the fact that the models
included the year of diagnosis as a continuous variable,
we were able to predict survival up to 10 years after
diagnosis, even for the patients diagnosed most recently
(ie, 2010–11). All models were fi tted with the STATA
command stpm2 using STATA 13.1.17,18
We included all patients diagnosed during the
40 years from 1971 to 2011 in the models to estimate
survival trends, but we report estimates for each cancer
survival index at 1, 5, and 10 years after diagnosis only
for six selected periods of diagnosis: 1971–72, 1980–81,
1990–91, 2000–01, 2005–06, and 2010–11. We defi ne the
diff erence
the oldest
(75–99 years) and youngest (15–44 years) groups as the
age gap in survival. We provide a simple summary of
changes in survival by age as the absolute change (%) in
the age gap since 1971. A negative value for this change
means that the age gap has become wider. For Wales,
reliable estimates of net survival could not be obtained
for 11·5% of the age-sex-cancer combinations because
in net survival between
1208
1971–72
1980–81
1990–91
2000–01
2005–06
2010–11 (prediction)
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
53·3%
54·1%
52·2%
10·6%
10·2%
11·0%
15·0%
14·7%
15·6%
15·4%
15·3%
15·5%
41·5%
42·6%
40·4%
All cancers com bined
50·1%
All patients
Men
44·7%
Women
55·5%
Oeso phagus
All patients
Men
Women
Stomach
All patients
Men
Women
Colon
All patients
Men
Women
Rectum
All patients
Men
Women
Pancreas
All patients
Men
Women
Larynx
Men
Lung
All patients
Men
Women
Me lanoma of skin
All patients
Men
Women
Breast
Women
Cervix
Women
Uterus
Women
Ovary
Women
Prostate
Men
Testis
Men
Kidney
All patients
Men
Women
81·6%
74·5%
86·7%
44·9%
45·4%
43·9%
43·7%
66·1%
81·9%
74·0%
75·6%
80·7%
16·0%
16·3%
15·4%
83·3%
29·8% 24·0%
25·2%
19·9%
27·9%
34·3%
55·8%
35·3%
50·6% 29·6%
61·0% 40·9%
28·8%
23·3%
34·1%
4·3%
4·0%
4·8%
5·2%
5·2%
5·3%
24·6%
25·3%
23·8%
24·2%
23·6%
25·0%
2·3%
2·4%
2·2%
3·5%
3·3%
3·9%
4·0%
4·0%
4·0%
22·8%
23·0%
22·6%
20·1%
19·1%
21·6%
1·2%
1·3%
1·1%
19·1%
18·5%
20·0%
20·6%
20·7%
20·5%
5·3%
4·8%
6·2%
8·2%
8·1%
8·4%
54·0% 34·2%
34·6%
55·2%
52·7%
33·8%
60·6%
61·4%
59·5%
12·1%
12·4%
11·9%
32·5%
32·0%
33·2%
2·8%
3·1%
2·4%
4·3%
3·8%
5·0%
6·7%
6·7%
6·8%
31·8%
31·5%
32·1%
28·2%
27·1%
29·6%
1·5%
1·7%
1·2%
60·6%
55·7%
65·3%
24·2%
24·1%
24·3%
26·8%
27·0%
26·5%
62·1%
63·5%
60·7%
6·5%
6·1%
7·1%
10·9%
10·6%
11·5%
41·6%
41·9%
41·3%
67·8% 42·0%
68·7%
41·7%
66·6% 42·4%
13·0%
13·5%
12·5%
2·8%
3·2%
2·4%
41·0%
34·8%
47·2%
34·4%
28·0%
40·7%
64·9% 47·4% 41·6%
60·7% 42·0% 36·0%
47·0%
69·0% 52·7%
5·1%
4·8%
5·6%
8·9%
8·6%
9·4%
38·6%
38·1%
39·0%
37·7%
36·7%
39·0%
1·5%
1·7%
1·3%
31·1%
32·5%
28·8%
8·8%
9·1%
8·2%
33·9% 14·1%
34·7%
13·9%
32·4% 14·5%
7·0%
7·3%
6·5%
11·3%
11·0%
11·8%
47·5% 44·5%
66·7%
68·1%
47·6% 43·6%
65·4% 47·5% 45·4%
74·0% 51·2% 47·1%
74·8% 51·0% 46·4%
72·8% 51·4% 48·2%
14·7%
15·3%
14·0%
2·7%
3·0%
2·4%
1·2%
1·4%
1·1%
67·6%
63·7%
71·5%
36·4%
38·3%
33·4%
37·8%
39·3%
35·2%
70·3%
71·9%
68·6%
76·7%
77·5%
75·6%
17·4%
18·1%
16·7%
50·9% 45·8%
45·8% 41·0%
56·0% 50·5%
70·5% 54·3% 49·8%
66·7% 49·2% 45·7%
74·2% 59·2% 53·8%
11·5%
12·0%
10·8%
9·3%
9·4%
9·1%
42·0% 15·3% 12·4%
44·3% 15·6% 12·0%
38·6% 14·7% 13·1%
16·3% 13·1%
16·5% 13·0%
16·1% 13·1%
41·7% 18·8% 15·0%
43·8% 19·5% 15·3%
37·9% 17·7% 14·4%
52·6% 50·3%
52·9
49·4%
52·3% 51·1%
73·9% 58·2% 56·9%
76·1% 59·2% 56·5%
71·7% 57·3% 57·4%
55·5% 51·7%
55·4% 51·0%
55·7% 52·7%
79·2% 59·7% 56·1%
79·9% 59·6% 55·5%
78·1% 59·8% 57·0%
3·0%
3·2%
2·7%
1·2%
1·2%
1·2%
20·9%
21·7%
20·2%
3·3%
3·6%
3·1%
1·1%
1·1%
1·1%
60·2%
50·4%
81·7%
62·1%
52·6%
82·8% 64·1%
54·9%
83·7% 66·0% 57·0%
84·2%
67·0% 58·2%
84·7% 67·9% 59·2%
4·6%
4·8%
4·3%
3·1%
3·2%
2·9%
18·3%
18·6%
17·8%
5·5%
5·8%
5·0%
3·7%
3·9%
3·2%
20·5%
20·4%
20·7%
6·0%
6·1%
5·9%
3·8%
3·9%
3·7%
24·4%
23·9%
25·2%
6·9%
6·6%
7·4%
4·0%
3·7%
4·5%
52·3% 46·4%
34·9%
40·5%
61·1%
54·9%
88·7% 66·4% 60·4%
56·4% 49·8%
84·5%
91·8%
73·7%
68·3%
93·1%
77·2%
90·8% 69·8%
94·9% 82·6%
71·9%
63·4%
78·2%
95·5% 83·8% 79·7%
94·0% 78·4% 73·3%
96·6% 87·8% 84·5%
28·0%
27·0%
29·7%
96·4%
95·2%
97·3%
8·0%
7·4%
9·1%
4·4%
3·8%
5·4%
32·2%
30·5%
35·1%
9·6%
8·4%
11·6%
5·0%
4·0%
6·6%
87·0% 84·4%
82·6% 79·3%
90·2% 88·3%
97·4% 90·4% 89·8%
96·6% 87·8% 86·8%
97·9% 92·4% 92·1%
52·7%
40·1%
85·9% 61·2%
48·4%
89·5%
71·1%
60·0%
92·7% 80·2%
71·6%
94·5%
83·9% 75·6%
96·0% 86·7% 78·5%
51·3% 46·0%
78·6% 58·3%
52·4%
81·6% 62·6%
57·2%
82·8% 65·4% 60·7%
82·6% 66·3% 61·9%
82·9% 67·5% 63·1%
59·0%
55·5%
79·5%
65·1%
61·5%
83·3% 69·5%
65·6%
86·9% 73·1% 69·7%
88·7%
75·9% 73·3%
90·3% 78·8% 77·4%
20·5%
17·9%
50·2%
24·9%
21·5%
57·0%
30·8%
26·4%
64·7% 38·4% 31·7%
68·8%
42·4% 33·5%
72·7% 46·4% 34·8%
36·9%
25·1%
71·5%
38·2%
24·4%
79·6% 49·6%
34·1%
89·5%
73·8% 62·4%
92·4%
81·4% 75·1%
94·0% 84·8% 83·6%
70·5% 69·2%
91·2% 84·0%
83·3%
95·8% 92·3%
91·9%
98·0% 96·3% 96·2%
98·7%
97·5% 97·4%
99·1% 98·3% 98·2%
28·5%
28·9%
28·0%
23·0%
23·0%
23·1%
51·3%
52·6%
49·1%
34·1%
35·3%
32·2%
27·6%
28·5%
26·1%
57·1%
58·7%
54·4%
39·4%
40·8%
37·1%
32·3%
33·4%
30·5%
62·8% 44·8% 37·9%
63·9% 45·2%
37·8%
60·9% 44·0% 38·0%
67·2%
68·0%
65·9%
49·8% 43·0%
50·0% 42·9%
49·4% 43·2%
72·5% 56·3% 49·6%
73·2% 56·7% 50·0%
71·3% 55·6% 48·9%
(Table 2 continues on next page)
1209
1971–72
1980–81
1990–91
2000–01
2005–06
2010–11 (prediction)
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
7·2%
6·6%
7·9%
56·5%
54·2%
59·4%
29·9%
29·3%
30·6%
17·7%
17·6%
17·9%
75·6%
73·9%
77·8%
39·3%
60·2%
62·8% 40·9%
53·4%
35·2%
(Continued from previous page)
Bladder
All patients
Men
Women
Brain
All patients
Men
Women
Hodgkin’s disease
All patients
Men
Women
Non-Hodgkin lym phoma
All patients
Men
Women
Multiple myeloma
All patients
Men
Women
Leu kaemia
All patients
Men
Women
Other cancers*
All patients
Men
Women
49·5%
49·4%
49·6%
11·8%
12·1%
11·4%
13·1%
13·1%
13·0%
37·4%
36·8%
38·0%
34·2%
35·4%
32·5%
55·3%
57·3%
53·0%
32·4%
33·7%
29·0%
5·4%
5·0%
6·0%
47·7%
45·2%
51·0%
22·0%
21·7%
22·3%
6·2%
6·8%
5·5%
6·9%
6·6%
7·2%
73·4% 56·0% 48·0%
76·0%
57·9% 49·3%
66·6% 50·8% 44·7%
52·8%
77·2% 60·8%
80·1%
54·2%
63·0%
69·6% 54·9% 49·0%
74·7% 56·4% 49·5%
78·5% 59·2%
52·0%
64·7% 49·1% 43·0%
23·3%
23·3%
23·3%
9·8%
9·2%
10·6%
7·2%
6·7%
7·8%
82·7% 66·8% 58·8%
65·1%
82·2%
56·5%
61·8%
83·3% 69·2%
27·7%
27·9%
27·4%
87·6%
87·5%
87·7%
11·8%
11·2%
12·7%
8·4%
7·9%
9·2%
30·4% 12·7%
30·9% 12·1%
29·8% 13·7%
8·8%
8·3%
9·5%
75·1%
69·2%
74·6% 68·7%
75·8% 69·9%
90·0% 80·3%
75·8%
89·7% 80·4% 75·8%
75·8%
90·3% 80·2%
58·8%
37·5%
58·6% 36·8%
59·0% 38·4%
28·1%
27·6%
28·8%
65·8% 44·9%
65·7%
44·2%
45·8%
66·0%
48·4%
47·8%
49·0%
47·3%
48·6%
45·6%
54·7%
54·3%
55·2%
17·2%
17·2%
17·1%
23·6%
23·7%
23·5%
36·5%
35·2%
37·9%
8·6%
9·0%
8·1%
14·9%
14·4%
15·6%
32·0%
30·7%
33·4%
57·4%
57·4%
57·3%
22·0%
22·2%
21·8%
57·8% 34·0%
59·4% 34·4%
55·8%
33·6%
35·2%
54·5%
52·6%
31·9%
56·6% 39·0%
35·2%
34·5%
35·9%
10·8%
11·1%
10·3%
24·0%
23·6%
24·6%
30·2%
26·9%
33·9%
70·1%
52·3% 43·9%
70·0% 51·6% 43·4%
53·2% 44·6%
70·2%
27·7%
64·5%
14·3%
65·7% 28·8% 15·1%
63·2% 26·4% 13·4%
63·8% 41·6% 32·3%
65·6% 42·4% 32·3%
61·4% 40·5%
32·2%
56·6% 37·1%
32·5%
55·0% 33·7% 29·2%
58·4% 41·0% 36·3%
73·5%
77·6%
63·0%
34·7%
35·3%
33·9%
90·8%
90·3%
91·4%
74·3%
74·4%
74·3%
70·6%
71·8%
69·3%
66·3%
68·3%
63·7%
59·7%
58·7%
60·9%
54·8% 49·2%
57·8% 52·4%
47·0% 40·9%
72·4% 53·4% 49·5%
76·6% 56·5% 53·5%
61·4% 45·3% 39·1%
15·0% 10·6%
14·2%
9·9%
16·1% 11·5%
40·1% 18·5% 13·5%
41·1%
17·8% 12·8%
38·8% 19·5% 14·5%
82·9% 78·3%
82·5% 77·2%
83·4% 79·7%
91·4% 85·0% 80·0%
90·8% 84·1%
77·7%
92·3% 86·3% 83·1%
59·7% 52·6%
59·1% 51·9%
60·5% 53·3%
79·6% 68·8% 63·1%
79·8% 68·1% 62·2%
79·4% 69·5% 64·1%
36·0% 21·4%
37·9% 23·5%
34·0% 19·2%
76·7% 47·0% 32·6%
78·0% 50·0% 36·8%
75·3% 43·8% 27·9%
46·4% 38·7%
47·7% 39·4%
44·6% 37·8%
68·6% 51·5% 46·1%
70·7% 53·3% 47·6%
65·9% 49·1% 44·2%
40·6% 36·6%
37·8% 33·9%
43·9% 39·7%
63·5% 45·2% 41·9%
63·1% 43·3% 40·1%
63·9% 47·5% 44·0%
38·4% 34·8%
40·4% 36·9%
36·2%
32·5%
*Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men.
Table 2: 40-year trends in the index of net survival for all cancers combined at 1, 5, and 10 years after diagnosis in adults (15–99 years) in England from 1971 to 2011 and trends in the
age-adjusted net survival for 21 selected cancers in England from 1971 to 2011 by sex
of the small number of patients, and broader age
groups were constructed to re-estimate survival for
those combinations.
Role of the funding source
The funder had no role in study design, quality control,
analysis, interpretation of the results, drafting, or the
decision to submit for publication. The corresponding
author had full access to all data and was responsible for
the decision to publish.
Results
The index of net survival for all cancers combined at 1, 5,
and 10 years since diagnosis increased substantially
between 1971 and 2011 in England and Wales (fi gure 1,
tables 2 and 3). The all-cancers survival index was 50% at
1 year after diagnosis for patients diagnosed in 1971–72.
For patients diagnosed during 2005–06, the index was
50% at 5 years after diagnosis, and for patients diagnosed
during 2010–11, we predict that the all-cancers survival
index will reach 50% at 10 years after diagnosis.
For patients diagnosed during 2010–11, the survival
index for all cancers combined had reached 69–70% at
1 year and a predicted value of 54% at 5 years for both sexes
combined. The 5-year survival index rose by 24% (from
30% to 54%) and the 10-year survival index by 26% (from
24% to 50%) between the periods 1971–72 and 2010–11.
Most of the increase occurred between 1990 and 2011.
The survival index for all cancers combined is
on average 10% higher for women than for men at each
time interval since diagnosis. The pattern of increase in
the index was fairly similar for both men and women
during the whole period, although the increase was
linear for women but it became steeper for men after
1990–91. For patients diagnosed during 2010–11, the all-
cancers survival index for women in England was 74% at
1 year, 59% at 5 years, and 54% at 10 years, whereas the
fi gures for men were 67% at 1 year, 49% at 5 years, and
1210
1971–72
1980–81
1990–91
2000–01
2005–06
2010–11 (prediction)
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
48·1% 28·9% 23·4%
42·9% 24·8% 20·4%
53·2% 32·9% 26·3%
53·6% 34·7%
48·2% 28·8%
58·9% 40·5%
28·9%
23·7%
34·1%
58·4% 40·4% 34·6%
53·2% 33·9% 28·1%
63·4% 46·8% 41·1%
63·2% 46·9% 41·6%
59·1% 41·5% 35·9%
67·2% 52·2% 47·2%
66·3% 50·6% 46·0%
62·7% 45·7% 41·0%
69·9% 55·5% 51·0%
69·4% 54·2%
65·9% 49·2%
72·8% 59·0%
50·2%
45·5%
54·8%
9·5%
8·7%
10·8%
15·2%
15·3%
15·0%
16·9%
17·9%
15·2%
All cancers combined
All patients
Men
Women
Oesophagus
All patients
Men
Women
Stomach
All patients
Men
Women
Colon
All patients
Men
Women
Rectum
All patients
Men
Women
Pancreas
All patients
Men
Women
Larynx
Men
Lung
All patients
Men
Women
Melanoma of skin
All patients
Men
Women
Breast
Women
Cervix
Women
Uterus
Women
Ovary
Women
Prostate
Men
Testis
Men
Kidney
All patients
Men
Women
15·6%
14·6%
17·4%
5·2%
5·1%
5·4%
5·7%
5·6%
6·0%
4·1%
3·8%
4·8%
4·6%
4·5%
4·9%
42·7% 25·0% 22·8%
43·1% 26·5% 24·5%
42·2% 23·4% 21·2%
50·8% 22·9% 19·7%
50·6% 21·4% 17·9%
51·0% 25·2% 22·1%
18·7%
19·1%
18·1%
6·0%
5·8%
6·4%
10·1%
21·3%
9·7%
21·0%
21·9% 10·8%
51·8% 33·3%
51·9% 33·3%
51·8% 33·3%
58·5%
31·2%
58·7% 29·9%
58·1%
33·1%
12·2%
11·5%
12·9%
3·8%
4·0%
3·7%
2·4%
2·7%
2·1%
12·8%
13·0%
12·5%
4·6%
5·6%
3·7%
5·2%
4·9%
5·6%
8·9%
8·6%
9·4%
30·9%
30·9%
31·0%
27·7%
26·1%
30·0%
3·4%
4·6%
2·3%
22·8%
23·2%
22·1%
6·9%
7·0%
6·8%
24·7% 10·8%
25·0% 10·6%
11·2%
24·1%
5·8%
6·0%
5·5%
9·2%
9·1%
9·3%
30·7%
32·8%
27·4%
8·8%
9·3%
7·9%
6·7%
7·1%
6·1%
35·5% 10·6%
37·7% 10·9%
32·1% 10·3%
7·9%
7·8%
8·0%
39·7%
42·3%
35·8%
12·9%
12·7%
13·3%
30·9% 12·6%
32·3%
12·6%
12·7%
28·5%
9·9%
9·9%
10·1%
36·5% 15·5% 12·0%
38·2% 15·5%
11·7%
33·5% 15·6% 12·4%
19·5%
19·4%
14·9%
43·1%
14·4%
45·0%
39·6% 19·6% 16·0%
58·4% 39·8% 37·2%
60·0% 40·3% 37·4%
56·9% 39·2% 37·0%
63·2% 45·2% 42·4%
65·8% 46·6% 43·3%
60·5% 43·7% 41·6%
67·8% 50·9% 48·3%
70·2% 51·8% 48·5%
65·4% 49·9% 48·0%
73·0%
74·9%
71·1%
57·7%
55·4%
57·9% 54·9%
57·5%
55·8%
65·7% 40·6% 37·1%
66·5% 39·8% 35·9%
64·6% 41·7% 38·9%
72·4% 50·0% 46·7%
73·2% 49·5% 45·8%
71·4% 50·8% 48·0%
75·2% 54·4% 51·3%
76·1% 54·1% 50·6%
74·0% 54·8% 52·3%
55·6%
77·7%
58·5%
78·6% 58·4%
55·1%
76·4% 58·6% 56·4%
12·9%
13·5%
12·4%
4·2%
5·0%
3·3%
2·8%
3·7%
2·0%
14·0%
14·8%
13·3%
3·0%
3·4%
2·6%
1·5%
1·8%
1·3%
16·3%
16·7%
15·8%
3·0%
3·4%
2·7%
1·3%
1·5%
1·2%
19·0%
19·4%
18·6%
3·3%
3·7%
2·9%
1·2%
1·4%
1·1%
77·7% 56·3% 45·9%
82·5% 64·8%
55·6%
82·1% 63·9% 54·5%
80·2% 60·4% 50·4%
81·4% 63·3% 53·7%
84·0% 68·1%
59·5%
5·1%
4·2%
6·6%
3·6%
2·8%
5·1%
18·7%
18·6%
18·8%
7·2%
7·2%
7·0%
5·5%
5·6%
5·3%
19·7%
19·5%
20·1%
6·8%
6·7%
6·9%
4·7%
4·6%
4·9%
21·5%
21·1%
22·2%
5·9%
5·5%
6·6%
3·3%
2·9%
4·0%
25·5%
24·4%
27·4%
6·9%
6·3%
8·0%
3·6%
3·1%
4·3%
31·1%
28·8%
35·2%
8·6%
7·7%
10·3%
4·2%
3·7%
5·1%
79·9% 51·1% 44·0%
73·8% 38·9% 33·3%
84·4% 60·1% 52·0%
74·9% 47·9% 34·8%
73·9% 52·8% 47·4%
72·7% 55·9% 53·4%
48·2% 22·2% 18·0%
62·7% 36·6% 27·8%
82·9% 69·5% 66·2%
43·7% 29·0% 24·4%
44·8% 30·6% 25·3%
41·9% 26·4% 22·9%
82·3% 63·1%
76·6% 51·0%
86·5%
57·2%
44·6%
72·0% 66·6%
85·6% 71·4% 66·3%
81·8% 62·5% 55·9%
88·3% 78·0% 73·9%
77·5%
91·3%
72·9%
89·4% 71·0% 65·8%
92·7% 82·2% 78·1%
94·4% 82·4% 77·6%
93·1% 76·4% 68·9%
95·3% 86·7% 84·1%
96·8% 89·0% 82·1%
95·8% 83·7% 68·3%
97·6% 92·9% 92·2%
81·8% 60·3%
48·5%
87·4% 71·7% 62·3%
91·4% 80·4% 73·4%
93·0% 83·8% 77·9%
94·3% 86·7%
81·8%
80·0% 63·2%
57·8%
78·6% 59·9% 55·0%
78·5% 59·9% 55·2%
79·7% 62·4% 57·5%
81·7%
65·6% 60·3%
76·2% 61·7%
56·8%
80·6% 67·0% 62·2%
85·3%
72·4% 69·6%
88·1% 76·8% 73·9%
90·5%
81·2%
77·8%
52·0% 26·2%
21·8%
56·9% 31·4% 26·6%
61·1% 36·6% 31·8%
63·1% 39·2% 34·4%
65·1%
41·9%
37·1%
65·9% 35·9%
25·6%
72·9% 44·6% 32·9%
85·0% 68·8% 59·1%
90·1% 79·8% 74·9%
93·7%
87·1%
87·1%
89·9% 81·1%
80·0%
94·4% 89·7% 89·1%
97·1% 95·0% 94·1%
97·4% 96·0% 94·4%
97·4% 96·6% 93·9%
46·9% 31·0%
48·5%
32·0%
44·2% 29·5%
25·1%
25·6%
24·4%
53·0% 36·5% 29·7%
54·6% 37·2% 30·0%
50·3% 35·4% 29·2%
61·6% 46·2% 39·5%
62·3% 46·9% 39·9%
60·5% 45·1% 38·7%
66·6% 51·2% 44·0%
67·6% 51·4% 43·5%
64·8% 50·8% 44·8%
70·8%
72·2%
68·5%
55·2%
47·3%
53·9% 44·2%
57·3%
52·4%
(Table 3 continues on next page)
1211
1971–72
1980–81
1990–91
2000–01
2005–06
2010–11 (prediction)
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
1 year
5 years
10 years
*Other cancers: all other malignant tumours are combined; they also include laryngeal cancer in women and breast cancer in men
Table 3: 40-year trends in the index of net survival for all cancers combined at 1, 5, and 10 years after diagnosis in adults (15–99 years) in Wales from 1971 to 2011 and trends in the
age-adjusted net survival for 21 selected cancers in Wales from 1971 to 2011 by sex
46% at 10 years. Both the levels and the trends in the all-
cancers survival index were similar in England and
Wales. The average absolute diff erence between the two
countries was less than 1% (fi gure 1, tables 2 and 3).
Survival for both sexes combined varied widely for
diff erent cancers, with the most recent predicted 10-year
net survival adjusted for age and sex ranging from only
1·1% for pancreatic cancer to 98·2% for testicular cancer.
A scatter-plot of the 1-year, 5-year, and 10-year survival
estimates for adults diagnosed in 2010–11 against the
absolute change since 1971 enables three broad clusters
of cancers to be identifi ed (fi gure 2). The fi rst cluster
consists of cancers with high survival in 2010–11 for
which the absolute increase in survival since 1971–72 is
progressively larger for survival at 1, 5, and 10 years. It
includes cancers of the breast, prostate, testis, and
uterus, and melanoma and Hodgkin’s disease.
The second cluster is of cancers with a moderate level of
survival (64–84%) in 2010–11 and, generally, smaller
increases since 1971–72. This cluster consists of cancers of
the larynx, cervix, rectum, colon, bladder, ovary, and
kidney, with non-Hodgkin lymphoma, multiple myeloma,
and leukaemia. For multiple myeloma and leukaemia,
age-adjusted 10-year survival rose by more than 22%
between the periods 1990–91 and 2010–11, from around
10·8% to a predicted 32·6% for multiple myeloma and
from 24·0% to 46·1% for leukaemia (table 2).
The third cluster is of cancers for which survival for
patients diagnosed during 2010–11 is still low, and for
which little or no improvement has occurred in the past
40 years: this group consists of malignancies of the brain,
stomach, lung, oesophagus, and pancreas.
This clustering can be seen as early as 1 year after
diagnosis, and each cancer is in the same cluster,
irrespective of the time since diagnosis (and the nation).
We observed the largest absolute change in the age-
adjusted survival for multiple myeloma, leukaemia, and
prostate cancer.
1212
Testis
Breast
Uterus
Melanoma
Hodgkin’s
Prostate
Larynx
Cervix
Bladder
Rectum NHL
Colon
Kidney
Myeloma
Leukaemia
Ovary
Other cancers
Brain
Stomach
Oesophagus
Lung
Testis
Hodgkin’s
Breast
Uterus
Melanoma
Prostate
Larynx
Cervix
NHL
Colon
Rectum
Kidney
Bladder
Leukaemia
Myeloma
Other cancers
Ovary
Brain
Stomach
Oesophagus
Lung
Pancreas
Testis
Melanoma
Hodgkin’s
Uterus
Breast
Prostate
Larynx
Cervix
Bladder
Colon
Kidney
NHL
Rectum
Leukaemia
Other cancers
Myeloma
Ovary
Brain
Stomach
Oesophagus
Lung
Pancreas
Testis
Melanoma
Uterus
Hodgkin’s
Breast
Prostate
Testis
Prostate
Melanoma
Hodgkin’s
Breast
Uterus
Larynx
Cervix
Colon
Kidney
NHL
Rectum
Bladder
Other cancers
Leukaemia
Ovary
Myeloma
Cervix
Larynx
Bladder
Colon
NHL
Rectum
Kidney
Leukaemia
Other cancers
Ovary
Myeloma
Brain
Stomach
Oesophagus
Lung
Pancreas
Brain
Stomach
Oesophagus
Lung
Pancreas
–5 0 5
15
25
35
45
55
65
–5 0 5
Absolute change since 1971 (%)
15
25
35
45
Absolute change since 1971 (%)
55
65
1213
England
100
Wales
Cluster 1
Testis
Breast Melanoma
Hodgkin’s
Uterus
Larynx
Cervix
Prostate
Rectum
NHL
Myeloma
Colon
Ovary
Leukaemia
Bladder
Kidney
Other cancers
Cluster 2
Brain
Oesophagus
Stomach
Lung
Pancreas
Cluster 2
Pancreas
lung cancer has
1-year survival from
improved
substantially, from 16% in 1971–72 to 32% in 2010–11.
However, estimated long-term survival for patients
diagnosed in 2010–11 is very poor for both sexes: as low
as 10% at 5 years and 4% and 7% in men and women,
respectively, at 10 years. This overall pattern of no
improvement in long-term survival is common in the
cluster of poor-prognosis cancers (oesophagus, stomach,
pancreas, and brain), for men and women and for both
England and Wales.
Survival for breast cancer has seen a rapid and
substantial improvement during the past 40 years. 5-year
survival increased from 53% in 1971–72 to a predicted
value of 87% in 2010–11. After 10 years, survival rose from
40% in 1971–72 to a predicted 78% for patients diagnosed
during 2010–11. Diff erences between 5-year and 10-year
survival estimates remained broadly constant since 1971,
showing that most of the improvements in long-term
survival arose in the fi rst 5 years after diagnosis. Breast
cancer accounted for nearly a third of all cancers in
women, which partly explains the higher all-cancers
survival index in women than in men.
Although survival from cancers of the colon and
rectum is much lower than survival from breast cancer
(around 20% lower in 2010–11), the trends in 1-year,
5-year, and 10-year survival for these two cancers have
followed an almost identical pattern to that of breast
cancer during the past 40 years.
For men diagnosed with prostate cancer during
2010–11, the predicted values for 5-year and 10-year
estimates are almost
identical at 85% and 84%,
respectively, which are huge increases from the values of
37% and 25% for men diagnosed 40 years ago. The trends
are quite distinct for short-term, medium-term, and
long-term survival. In both England and Wales, 1-year
survival has been increasing since 1971–72, whereas
acceleration in 5-year survival started for men diagnosed
in the 1980s; 10-year survival only began increasing for
men diagnosed in the 1990s.
For women diagnosed with cancer of the ovary during
2010–11, the age-adjusted survival was predicted as 46%
at 5 years and 35% at 10 years compared with 20% and
18%, respectively, for women diagnosed during 1971–72.
These results suggest that the underlying increase in
survival of up to 5 years is likely to continue.
Net survival is generally lower for the oldest patients
(75–99 years) than the youngest (15–44 years), even
though net survival accounts for a higher mortality from
causes other than cancer in elderly patients. This fi nding
is shown by a scatter-plot of the age gap in net survival at
1, 5, and 10 years after diagnosis for adults diagnosed in
Figure 2: Net survival adjusted for age and sex for each cancer in 2010–11,
and absolute change* since 1971, all adults (15–99 years), England and
Wales: 1, 5, and 10 years after diagnosis
*The absolute change is the simple arithmetic diff erence between net survival in
2010–11 and the survival in 1971–72. NHL=non-Hodgkin lymphoma.
England
Wales
Testis
Melanoma
Prostate
Melanoma
Colon
Larynx
Hodgkin’s
Rectum
Stomach
NHL
Bladder
Testis
Pancreas
Oesophagus
Leukaemia
Kidney
Myeloma
Lung
Other cancers
Prostate
Oesophagus
Larynx
Colon
Rectum
Stomach
Lung
NHL
Kidney
Bladder
Myeloma
Pancreas
Leukaemia
Other cancers
Hodgkin’s
Brain
Brain
Prostate
1
1
0
2
–20
Testis
Colon
Oesophagus
Pancreas
Stomach
Lung
Melanoma
Larynx
Rectum
Kidney
Leukaemia
NHL
Bladder
Prostate
Other cancers
Myeloma
Brain
Hodgkin’s
0
–10
1
1
0
Pancreas
Larynx
Rectum
Hodgkin’s
NHL
Bladder
Colon
Melanoma
Oesophagus
Testis
Stomach
Lung
Kidney
Leukaemia
Myeloma
Other cancers
Brain
Colon
Oesophagus
Stomach
Pancreas
Rectum
Prostate
Melanoma
Lung
Larynx
Testis
Brain
Kidney
Other cancers
Hodgkin’s
Bladder
NHL
Testis
Pancreas
Oesophagus
Stomach
Lung
Colon
Rectum
Melanoma
Larynx
1
1
0
2
–20
Prostate
–30
–40
Leukaemia
Kidney
NHL
Brain
Bladder
Myeloma
–50
–60
–70
–80
Other cancers
Myeloma
Leukaemia
Hodgkin’s
–50 –40 –30 –20 –10
0
10
20
30 40
–50 –40 –30 –20 –10
0
10
20
30
40
Increase in age gap Decrease in age gap
Increase in age gap Decrease in age gap
Absolute change in age gap since 1971 (%)
Absolute change in age gap since 1971 (%)
2010–11 against the absolute change since 1971–72: it
shows a negative gap in survival for most cancers (y-axis
of fi gures 3 and 4).
The largest age gaps in survival in men were observed
for cancers for which high-dose chemotherapy is the key
treatment (lymphoma, multiple myeloma, and leukaemia),
but we could not identify any overall temporal patterns.
For women, the largest age gaps were noted for brain
tumours, and cancers of the ovary and cervix, and multiple
myeloma, but the clustering was less obvious than in men.
The age gap tended to narrow for melanoma and cancer of
the uterus in women but widened for long-term survival of
ovarian cancer.
Discussion
The index of net survival for all cancers combined has
increased substantially: for patients diagnosed in 1971–72,
the index was 50% at 1 year after diagnosis. Our
prediction is that, for patients diagnosed during 2010–11,
the all-cancers survival index will reach 50% at 10 years
after diagnosis. Very similar patterns of change and
levels of survival were noted in both England and Wales.
Survival has increased steadily during the 40 years
since 1971, with a slight acceleration in the past
10–15 years, particularly for 5-year and 10-year survival, in
both England and Wales. After implementation of the
NHS cancer plan for England,19 we reported a slight
acceleration in the 1-year cancer survival trends during
2004–06, by contrast with Wales,2 where a national cancer
plan was only introduced in 2006. The pattern was not so
clear for survival at 3 years after diagnosis. The fi ndings
reported here suggest a continuing acceleration of these
trends for longer-term survival between 2005–06 and
2010–11 in England, but also in Wales (panel).
The completeness and quality of cancer registration
and follow-up data in both England and Wales have been
systematically assessed and are thought to be very high
throughout the period 1971–2011, despite undeniable
improvement during the 1970s–80s.21–23 This improvement
cannot explain long-term trends in cancer survival.24,25
Furthermore, with the exception of bladder cancer, overall
changes in disease defi nitions are limited, even for
haemopoietic malignancies. To aff ect the survival index,
such a change in disease defi nition would need to aff ect a
substantial proportion of all cancers, for which prognosis
would also need to be very diff erent from that for other
cancers. These conditions are not met.
In some strata defi ned by age, sex, cancer, and calendar
period of diagnosis, especially in Wales, few deaths
Figure 3: Age gap* in net survival by cancer, men (15–99 years) diagnosed
during 2010–11 versus absolute change† in the age gap since 1971, England
and Wales: 1, 5, and 10 years after diagnosis
*The age gap represents the absolute diff erence (%) between net survival in the
oldest (75–99 years) and youngest (15–45 years) groups of patients; a negative
value means that survival is lower in the oldest group than the youngest group.
†The absolute change is the simple arithmetic diff erence between the age gap in
2010–11 and the age gap in 1971–72. NHL=non-Hodgkin lymphoma.
1214
England
Wales
Melanoma
Breast
Melanoma
Breast
Uterus
Rectum
Hodgkin’s
Myeloma
Leukaemia
Stomach
Lung
Colon
NHL
Bladder
Kidney
Oesophagus
Pancreas
Other cancers
Cervix
Ovary
Stomach
Oesophagus
Pancreas
Myeloma
Lung
Bladder
Kidney
Other cancers
Uterus
Hodgkin’s
Leukaemia
Rectum
Colon
NHL
Cervix
Ovary
Brain
Brain
Breast
Colon
Melanoma
Pancreas
Stomach
Oesophagus
Lung
Uterus
Rectum
Leukaemia
NHL
Kidney
Bladder
Hodgkin’s
Myeloma
Other cancers
Brain
Ovary
Cervix
Pancreas
Breast Melanoma
Hodgkin’s
Uterus
Leukaemia
Colon
Rectum
Oesophagus
Lung
Stomach
Myeloma
NHL
Kidney
Bladder
Cervix
Ovary
Other cancers
Brain
Colon
Melanoma
Rectum
Uterus
Pancreas
Breast
Lung
Stomach
Oesophagus
Pancreas
Breast
Melanoma
Colon
Rectum
Lung
Oesophagus
Stomach
Hodgkin’s
Leukaemia
Uterus
–30
–40
–50
Bladder
Leukaemia
NHL
Kidney
Brain
Other cancers
Hodgkin’s
–60
Myeloma
Ovary
Cervix
Myeloma
Bladder
NHL
Ovary
Brain
Kidney
Cervix
Other cancers
–70
–80
–50 –40 –30 –20 –10
0
10
20
30 40
–50 –40 –30 –20 –10
0
10
20
30
40
Increase in age gap Decrease in age gap
Increase in age gap Decrease in age gap
Absolute change in age gap since 1971 (%)
Absolute change in age gap since 1971 (%)
0
occurred. To obtain more stable net survival estimates, we
therefore estimated net survival using a modelling
approach rather than the non-parametric Pohar-Perme
approach.20
The index of net survival for all cancers combined
provides one convenient number that summarises the
overall patterns of cancer survival in any one population
or country, in each calendar period for young and old men
and women and for a wide range of cancers with very
disparate survival. The index is unaff ected by changes in
the proportion of cancers of diff erent lethality in either
sex, such as the reduction of lung cancer or the increase in
prostate cancer in men. Similarly, the index is unaff ected
by ageing of the population of patients with cancer or
shifts in the proportion of any cancer between men and
women. The value of the index changes only when survival
for one or more cancers changes, for one or more age
groups. The index therefore shows overall progress in
cancer management, whether from earlier diagnosis, or
earlier stage of disease, or improved treatment and care.
However, the all-cancers survival index needs careful
interpretation: for example, the predicted value of 50% for
the 10-year all-cancers survival index for 2010–11 does not
mean that half of all patients will be cured or “beat cancer”,
as has been portrayed in the media.26 The index is designed
as a public health measure that summarises cancer
survival trends in an entire population, to help to assess
progress in the overall eff ectiveness of the health system in
diagnosis and management of patients with cancer. The
index does not refl ect the prospects of survival for any
individual patients with cancer. The index is based on net
survival, which is an unbiased measure of population-
based survival from cancer after adjustment for other
causes of death. Net survival is the most valid available
metric for comparison of survival between populations
and for assessment of progress in cancer survival over
time. The all-cancers net survival index should nevertheless
be interpreted in conjunction with other information
available in the population or country for which the index
has been prepared. It should be seen as a guide to raise
questions about the potential for improvement.
The average 10% diff erence in the survival index
between men and women has been a consistent feature
for 40 years. It arises because, for several individual
cancers, survival is slightly higher for women, but mostly
because the cancers that are most common in women,
such as breast cancer (weight of 0·31 in the survival index
for women), generally have higher survival than the
cancers that are most common in men, such as lung
Figure 4: Age gap* in net survival by cancer, women (15–99 years) diagnosed
during 2010–11 versus absolute change† (%) in the age gap since 1971,
England and Wales: 1, 5, and 10 years after diagnosis
*The age gap represents the absolute diff erence (%) between net survival in the
oldest (75–99 years) and youngest (15–45 years) groups of patients; a negative
value means that survival is lower in the oldest group than the youngest group.
†The absolute change is the simple arithmetic diff erence between the age gap in
2010–11 and the age gap in 1971–72. NHL=non-Hodgkin lymphoma.
1215
See Online for appendix
Panel: Research in context
Systematic review
Health policy measures to improve the organisation and
delivery of services for the prevention, diagnosis, and
treatment of cancer should be based on sound evidence.
Population-based survival trends have proved to be a key
metric for the overall eff ectiveness of health systems. An
unbiased estimator of net survival was introduced in 2012.20
We have not undertaken a literature review, but so far, only a
few countries have published population-based cancer
survival using this estimator, including in England by our
research group.12 No other country has constructed a single,
summary index of net survival for all cancers combined.
A simple, robust, one-number index of net survival for all
cancers combined can contribute to the evidence base for
rational health policy.
Interpretation
Changes in the net survival index refl ect changes in survival
for one or more cancers, not simply changes in the
distribution of cancer patients by age, cancer site, or sex. The
net survival index increased substantially between 1971 and
2011, representing a substantial gain in overall survival from
all cancers combined. Net survival varied widely for diff erent
cancers, and was generally lower for older patients than
younger patients, even after adjustment for the higher
mortality from other causes in older patients. Three clusters
of cancers, with high, moderate, and low survival, can be
distinguished as early as 1 year after diagnosis. Overall, the
survival trends are encouraging in both England and Wales,
but they also suggest strongly the need for renewed eff orts to
achieve better outcomes.
cancer (weight of 0·22 in the index for men). The slight
narrowing in the sex gap observed in the most recent
periods might be explained by the rapid increase in
survival for prostate cancer (weight of 0·19 in the index for
men), particularly at 5 and 10 years after diagnosis. This
rapid increase in survival for prostate cancer has been
largely attributed to the widespread use of prostate-specifi c
antigen (PSA) testing, resulting in the diagnosis of many
less advanced tumours with a shift of the stage distribution
to less advanced and less aggressive disease. However,
importantly, survival had already started to increase, albeit
more slowly, much before PSA testing was widely used.27
The more recent increase in long-term survival suggests
that this improvement is not simply because of a shift in
the stage distribution after increasingly wide use of the
PSA test. The increase in short-term survival, which began
as early as the 1970s, and the increase in 5-year survival in
the 1980s and then in the 10-year survival in the following
decade cannot simply be attributed to PSA.
We were able to group the 21 most common cancers
into three clusters on the basis of their survival. Despite
some large gains in survival, these clusters are, with few
exceptions, the same in 2011 as in 1971 (data not shown).
The clusters are identifi able as early as 1 year after
diagnosis, and they are consistent at 5 and 10 years after
diagnosis, both in England and Wales.
Cluster 1 includes cancers with a good prognosis:
survival is now very high, after a large increase since
1971, particularly at 5 and 10 years after diagnosis. 1-year
survival seems to have reached a ceiling for most of these
cancers, but survival at 5 and 10 years is still much lower
than at 1 year for breast cancer and Hodgkin’s disease.
The absence of any plateau in survival, even 10 years after
diagnosis, shows that cure at the population level has still
not been reached for these cancers, leaving room for
substantial further improvement in long-term survival.
For most cancers in the other two clusters, survival at 5
and 10 years after diagnosis is still much lower than
1-year survival. The second cluster consists of a further
mix of cancers for which either survival has remained
moderate since the early 1970s, or moderate levels of
survival in 2011 are the result of large improvements
during the past 40 years. The second situation is well
illustrated by the steep increase in survival from multiple
myeloma since 2000–01, probably explained by the
introduction of higher-dose treatment regimens around
2000. For the cancers in this cluster that have shown no
evidence of improvement, eff orts should be made to
achieve earlier diagnosis, and to focus on stricter
guidelines for improved treatment, such as increased use
of surgery, radiotherapy with curative intent, neoadjuvant
therapies, or a combination of the three.
The eff ect of mass-screening on survival varies with the
cancer. For cervical cancer, an effi cient screening
programme does not necessarily lead to an improvement
in survival because screening prevents the occurrence of
invasive tumours, thereby reducing incidence, and the
remaining patients are, on average, diagnosed with more
advanced disease.28 A quasi-plateau in 1-year survival has
been observed since 2000–01 (appendix 1 and 2).
By contrast, breast cancer screening aims to diagnose
the disease at an early stage, rather than to prevent it. Its
real eff ect on survival has been questioned mainly because
of possible overdiagnosis and lead time. However,
overdiagnosis does not exceed a few percent,29 and the
advantage in survival remains important for screen-
detected breast cancer after accounting for lead time.30
Improvement in breast cancer survival has been large
because of both early diagnosis and improved treatment,
although net survival continues to decrease even 10 years
after diagnosis, showing late recurrences. The age gap in
survival has also decreased, supporting more rapid
improvement in survival for older women (and for the
screened age group) than in younger women.31
to prevent
invasive malignant
Screening for colorectal cancer, which started in 2006,
aims
tumours (by
removing polyps with adenoma tous change) and to
diagnose cancer at an early stage. Therefore, although it
is too recent to have any eff ect on these results, lessons
from both cervical and breast cancer screening
1216
programmes will also help us to monitor the eff ect of
screening on the prognosis of colorectal cancer.
in young patients
A wide age gap in survival was still present for most
cancers in 2010–11. Some of these diff erences are related
to screening or early diagnostic practices (breast, cervix,
prostate). Also, the disease, and its prognosis, might
radically diff er by age, such as leukaemia: the treatment
of acute disease
improved
substantially, by contrast with chronic leukaemia in
elderly patients, but separation of both diseases is not
possible over the entire period 1971–2011. However, in
other countries, the age gap in cancer survival is much
narrower than in England and Wales.32,33 The wide age-
related inequalities in cancer survival in England and
Wales are thus likely to be avoidable. They could be
substantially reduced.
1-year survival has improved substantially for cancers
with a particularly poor prognosis (cluster 3), but longer-
term survival (5 and 10 years after diagnosis) has hardly
changed during the past four decades. Among these
cancers, substantial improvements should be achievable
for lung cancer: in 2011, National Institute for Health and
Care Excellence (NICE) guidelines34 underlined the need
for improved staging and increased widespread access to
surgery and radiotherapy with curative intent for non-
small-cell lung cancer. Adherence to these guidelines
and their eff ect on cancer outcomes has not yet been
exhaustively assessed.35
In summary, despite impressive overall improvements
in cancer survival during the past 40 years in both
England and Wales, the wide and persistent diff erences
in survival between cancers, together with the wide and
persistent age gap in survival for most cancers, suggest
the need for renewed eff orts to achieve improved
outcomes, particularly in elderly patients. The fi ndings
reported here off er clues for focused research to dissect
the underlying causes of these diff erences in cancer
survival. The results should prompt action to improve
public health in both England and Wales. This research
will need systematic linkage of clinical audit streams
and other detailed data streams to population-based
cancer registry data, but the recent crisis of public
concern about the sharing of individual health data for
confi dential public health research will need to be
resolved fi rst.36
Contributors
MQ did the analysis. MQ and BR designed the analytic strategies
and constructed the indexes. MQ, MPC, and BR wrote the Article
and interpreted the findings.
|
Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them. But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using |
Hello,
Specifying the UTF-8 encoding did help clear up many of the other issues we had when reading in the .txt file. However, I will see if we have access to the pdf files and go from there. I did utilize your work on the example pdf document and noted that it was effective, which may be the route we have to take. Thank you again.
Best,
…____________________________________________________
Keith L. Johnson, Ph. D.
Montana State University
Physics Department, Barnard Hall 226
PER Group
________________________________
From: Kenneth Benoit <notifications@github.com>
Sent: Sunday, February 10, 2019 9:27:56 PM
To: quanteda/readtext
Cc: Johnson, Keith; Author
Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146)
Thanks - that will of course read the ligatures because they are part of your text file. readtext does not convert ligatures if the .txt file contains them.
But if you are converting from a pdf file, then we can help. My own tests showed that ligatures are correctly processed. Did you read the text above using readtext() where the source file was a pdf?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#146 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AtNcbUMr3sfPHFjsiVtmVsNfEgRjZDkzks5vMPFMgaJpZM4ayygK>.
|
One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.) txt <- "substantial investments in human and financial
resources for the effect of earlier diagnosis of waffles."
cat(stringi::stri_trans_nfkc(txt))
## substantial investments in human and financial
## resources for the effect of earlier diagnosis of waffles. |
Dr. Benoit
Since last we communicated I have come across some pdf files (one example attached) which also seem to contain ligatures pertaining to "fi", "fl", etc. (as well as symbols such as lamba which do not concern us). I have again attached all the relevant data that you requested before. If you would like me to open up a separate case on github, let me know. Thank you for all your insight!
The sessionInfo() data:
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readtext_0.71 quanteda_1.3.14
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 magrittr_1.5 stopwords_0.9.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35 R6_2.2.2 rlang_0.2.0
[9] fastmatch_1.1-0 stringr_1.3.0 httr_1.3.1 plyr_1.8.4 pdftools_2.1 tools_3.4.4 grid_3.4.4 data.table_1.10.4-3
[17] gtable_0.2.0 spacyr_1.0 yaml_2.1.18 lazyeval_0.2.1 RcppParallel_4.4.2 tibble_1.4.2 Matrix_1.2-12 ggplot2_2.2.1
[25] ISOcodes_2018.06.29 stringi_1.1.7 compiler_3.4.4 pillar_1.2.1 scales_0.5.0 lubridate_1.7.4
The code I utilized (as it pertains to just the attached file):
library(quanteda)
library(readtext)
tempsci <- readtext("hep-th9910196.pdf")
sciCorp <- corpus(tempsci)
doc_term_matrix <- dfm(sciCorp,remove = stopwords("english"),remove_punct = TRUE,remove_numbers=TRUE,stem = FALSE)
FinalSciCorp = textstat_frequency(doc_term_matrix)
Best,
…____________________________________________________
Keith L. Johnson, Ph. D.
Montana State University
Physics Department, Barnard Hall 226
PER Group
________________________________
From: Kenneth Benoit <notifications@github.com>
Sent: Sunday, February 10, 2019 10:06:57 PM
To: quanteda/readtext
Cc: Johnson, Keith; Author
Subject: Re: [quanteda/readtext] readtext issue: ligature artifacts remain, even after specifying UTF-8 encoding. (#146)
One option to you is "Unicode normalization", which will convert the ligatures. (But I am not sure why you have a space above after each ligature - this would need to be dealt with separately, if it's actually part of your file.)
txt <- "substantial investments in human and financial
resources for the effect of earlier diagnosis of waffles."
cat(stringi::stri_trans_nfkc(txt))
## substantial investments in human and financial
## resources for the effect of earlier diagnosis of waffles.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#146 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AtNcbYh8pPX8wN55B86JtKTmJbgfXCfnks5vMPpxgaJpZM4ayygK>.
|
Thanks but please either upload them by dragging into the GitHub browser, or send them to me by email. (They did not show up above.) |
Thanks. That definitely contains the ligatures, and library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
rtxt <- readtext::readtext("~/Downloads/hep-th9910196.pdf")
texts(rtxt) %>%
kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##
## [hep-th9910196.pdf, 1523] the | infinity | so
## [hep-th9910196.pdf, 1614] the | fluctuation | (
## [hep-th9910196.pdf, 1812] the | differential | equation
## [hep-th9910196.pdf, 1826] the | coefficients | of
## [hep-th9910196.pdf, 1987] the | infinity | ,
## [hep-th9910196.pdf, 2928] the | infinity | r
## [hep-th9910196.pdf, 3070] the | infinity | ,
## [hep-th9910196.pdf, 3760] are | cutoff | for
texts(rtxt) %>%
stringi::stri_trans_nfkc() %>%
kwic(c("in*nity", "*erential", "coe*cients", "*uctuation", "cuto*"), 1)
##
## [text1, 1523] the | infinity | so
## [text1, 1614] the | fluctuation | (
## [text1, 1812] the | differential | equation
## [text1, 1826] the | coefficients | of
## [text1, 1987] the | infinity | ,
## [text1, 2928] the | infinity | r
## [text1, 3070] the | infinity | ,
## [text1, 3760] are | cutoff | for |
After reading in a corpus of scientific work in the following manner, we eventually get an object back out (FinalSciCorp) which contains all unique words within the corpus, along with some other information:
However, FinalSciCorp still contains some words with ligatures such as "ff" and "fi", among others. As an example, FinalSciCorp contains both the words "field" and "<U+FB01>eld", or in another case just the word "signi<U+FB01>cant". The 'encoding' and 'stri_enc_detect' functions both indicate that the files are likely "UTF-8" although we have also tried many other options, including "latin1" for encoding.
The text was updated successfully, but these errors were encountered: