Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: The parameter date_format in read_csv #58463

Open
1 of 3 tasks
BdR76 opened this issue Apr 28, 2024 · 1 comment
Open
1 of 3 tasks

ENH: The parameter date_format in read_csv #58463

BdR76 opened this issue Apr 28, 2024 · 1 comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@BdR76
Copy link

BdR76 commented Apr 28, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The read_csv has a parameter date_format which can can be "str or dict of columns", see documentation.

So for parsing date columns you can either use:

  • str = all columns have the same datetime format
  • dict = explicitly specify the datetime format for every individual datetime column

However, in practise a csv file with different datetime formats, usually it's just either a date or a datetime (or time) format. In other words the date format can differ a lot for different csv files, but usually it doesn't differ that much within one file. Theoretically there could be US and European date formats mixed in one csv file, but I work with a lot of csv data and I've never seen this. From my expericence this is a very uncommon use-case.

Feature Description

So for example, a csv file can have 10 date columns formatted like 01-05-2024 and 5 columns formatted like 05-05-2024 12:30.
Reading such a csv file with read_csv with many datetime columns, just the str parameter is not sufficient but the dict parameter is a bit overkill because you have to explicitly set the format for each column when basically there are just two groups, so it's not very practical.

So my feature request is:

Can the read_csv be updated so that the date_format parameter also accepts just a list of dateformat strings for the date columns? So for example date_format=['%d-%m-%Y', '%d-%m-%Y %H:%M:%S']

Alternative Solutions

Alternatively, I think it could be practical for most typical use-cases to give groups of dateformats. So instead of having to supply a parameter for each individual column, like this:

date_formats = {'incdate': '%d-%m-%Y %H:%M:%S',
             'dob': '%d-%m-%Y',
             'visitdate': '%d-%m-%Y %H:%M:%S',
             'ther_startdat1': '%d-%m-%Y',
             'ther_startdat2': '%d-%m-%Y',
             'ther_startdat3': '%d-%m-%Y',
             'ther_startdat4': '%d-%m-%Y',
             'ther_startdat5': '%d-%m-%Y',
             'ther_uptake1': '%H:%M:%S',
             'ther_uptake2': '%H:%M:%S',
             'ther_uptake3': '%H:%M:%S',
             'ther_uptake4': '%H:%M:%S',
             'ther_uptake5': '%H:%M:%S'}

It could be changed so you have to supply groups like this, which is less code and more reflecting the actual situation:

date_formats = {'%d-%m-%Y %H:%M:%S': ['incdate', 'visitdate'],
             '%d-%m-%Y': ['dob', 'ther_startdat1', 'ther_startdat2', 'ther_startdat3', 'ther_startdat4', 'ther_startdat5'],
             '%H:%M:%S': ['dob', 'ther_uptake1', 'ther_uptake2', 'ther_uptake3', 'ther_uptake4', 'ther_uptake5']
}

Additional Context

See code examples below for typical csv files with date values (it is all randomly generated test data)

# Library
import io
import pandas as pd

# load data hard coded, for eaxmple
str_cardio = """TestDate;Protocol;CustomId;SubjectId;BirthDate;Gender;TestStage;Duration;AvgHeartRate;AvgLoad
22-4-2024 17:31:13;Long HIIT 55W 45 sec.;;75110500;26-1-1970;Male;Warmup;151;85,1;26,5
22-4-2024 17:31:13;Long HIIT 55W 45 sec.;;75110500;26-1-1970;Male;Training;1352;113,8;52,6
22-4-2024 17:31:13;Long HIIT 55W 45 sec.;;75110500;26-1-1970;Male;Recovery;256;86,2;5,4
19-4-2024 16:17:32;Long HIIT 130W 45 sec.;;85775845;19-4-2024;Male;Warmup;110;64,9;64,3
19-4-2024 16:17:32;Long HIIT 130W 45 sec.;;85775845;19-4-2024;Male;Training;2485;144,3;128,4
19-4-2024 16:17:32;Long HIIT 130W 45 sec.;;85775845;19-4-2024;Male;Recovery;265;101,0;12,8
17-4-2024 11:58:03;ONCO Interval 70- 90 (3.00-4.00);;23251624;12-11-1969;Female;Warmup;108;53,2;38,8
17-4-2024 11:58:03;ONCO Interval 70- 90 (3.00-4.00);;23251624;12-11-1969;Female;Training;948;115,0;77,6
17-4-2024 11:58:03;ONCO Interval 70- 90 (3.00-4.00);;23251624;12-11-1969;Female;Recovery;235;74,5;7,8
"""

# no date time hints, TestDate and BirthDate imported as string `object`
#df_cardio = pd.read_csv(io.StringIO(str_cardio), sep = ";", decimal=",")
#print(df_cardio.dtypes)

# datetime columns
col_dates = ['TestDate', 'BirthDate']
date_formats = {'TestDate': '%d-%m-%Y %H:%M:%S',
             'BirthDate': '%d-%m-%Y'}

# read csv file
#df_cardio = pd.read_csv(io.StringIO(str_cardio), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=date_formats)
#f_cardio = pd.read_csv(io.StringIO(str_cardio), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=['%d-%m-%Y %H:%M:%S', '%H:%M:%S']) # alternative?
df_cardio = pd.read_csv(io.StringIO(str_cardio), sep=';', decimal=',', header=0, parse_dates=col_dates, dayfirst=True)
print(df_cardio.dtypes)
print(df_cardio.TestDate)
print(df_cardio.BirthDate)

# load data hard coded, for eaxmple
str_therapy = """patid;incdate;dob;sex;visitdate;length;weight;thercode;ther_startdat1;ther_startdat2;ther_startdat3;ther_startdat4;ther_startdat5;ther_uptake1;ther_uptake2;ther_uptake3;ther_uptake4;ther_uptake5
6682;5-5-2023;16-09-1973;Male;5-5-2023 13:30:00;172;51.1;C680;12-6-2023;;;;;0:02:28;;;;
3785;30-3-2024;15-03-1994;Male;30-3-2024 11:15:00;176;;C370;25-4-2024;20-5-2024;;;;0:08:32;0:02:48;;;
9392;28-2-2024;26-01-1983;Male;28-2-2024 15:00:00;187;81.1;A930;1-4-2024;28-5-2024;20-6-2024;;;0:23:53;0:48:59;0:47:47;;
1430;29-7-2023;23-11-1976;Female;29-7-2023 17:15:00;170;61.2;C140;28-8-2023;;;;;0:11:23;;;;
2001;2-12-2022;24-11-1998;Male;2-12-2022 14:00:00;148;62;C200;14-1-2023;13-2-2023;;;;0:33:26;0:27:27;;;
1081;27-3-2023;23-09-1990;Male;27-3-2023 14:30:00;149;73.9;C180;1-4-2023;;;;;0:34:16;;;;
9886;20-2-2023;05-12-1992;Male;20-2-2023 11:30:00;160;83.7;A980;2-3-2023;10-4-2023;10-6-2023;;;0:12:43;0:05:12;0:23:19;;
7291;10-5-2023;01-07-1999;Male;10-5-2023 12:30:00;157;63.7;C720;4-6-2023;;;;;0:58:17;;;;
7851;19-8-2023;09-09-1987;Male;19-8-2023 09:15:00;178;92.9;C780;30-9-2023;13-10-2023;;;;0:02:01;0:38:51;;;
5878;11-4-2023;15-11-1961;Female;11-4-2023 15:00:00;154;81.9;C580;8-5-2023;;;;;0:46:43;;;;
"""

# no date time hints, TestDate and BirthDate imported as string `object`
#df_therapy = pd.read_csv(io.StringIO(str_therapy), sep = ";", decimal=",")
#print(df_therapy.dtypes)

# datetime columns
col_dates = ['incdate', 'dob', 'visitdate', 'ther_startdat1', 'ther_startdat2', 'ther_startdat3', 'ther_startdat4', 'ther_startdat5', 'ther_uptake1', 'ther_uptake2', 'ther_uptake3', 'ther_uptake4', 'ther_uptake5']
date_formats = {'incdate': '%d-%m-%Y %H:%M:%S',
             'dob': '%d-%m-%Y',
             'visitdate': '%d-%m-%Y %H:%M:%S',
             'ther_startdat1': '%d-%m-%Y',
             'ther_startdat2': '%d-%m-%Y',
             'ther_startdat3': '%d-%m-%Y',
             'ther_startdat4': '%d-%m-%Y',
             'ther_startdat5': '%d-%m-%Y',
             'ther_uptake1': '%H:%M:%S',
             'ther_uptake2': '%H:%M:%S',
             'ther_uptake3': '%H:%M:%S',
             'ther_uptake4': '%H:%M:%S',
             'ther_uptake5': '%H:%M:%S'}

# read csv file
df_therapy = pd.read_csv(io.StringIO(str_therapy), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=date_formats)
#df_therapy = pd.read_csv(io.StringIO(str_therapy), sep=';', decimal=',', header=0, parse_dates=col_dates, date_format=['%d-%m-%Y', '%d-%m-%Y %H:%M:%S', '%H:%M:%S']) # alternative?
#df_therapy = pd.read_csv(io.StringIO(str_therapy), sep=';', decimal=',', header=0, parse_dates=col_dates, dayfirst=True)
print(df_therapy.dtypes)

print(df_therapy.ther_startdat2)
print(df_therapy.ther_uptake2)
@BdR76 BdR76 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 28, 2024
@Aloqeely
Copy link
Contributor

I am ok with this idea, but I think this can be accomplished in multiple ways without affecting practicality, so I'm not sure if this is necessary.

As for the alternative idea, I'm not really a fan of it, that dict format is not very readable in my opinion, and with just a few lines of code you can convert it to the appropriate dict format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants