merve-data-report Report

Dataset statistics

Number of variables	3
Number of observations	37
Missing cells	0
Missing cells (%)	0.0%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	1016.0 B
Average record size in memory	27.5 B

Variable types

Categorical	3

Alerts

`message` is highly correlated with `name` and 1 other fields	High correlation
`name` is highly correlated with `message` and 1 other fields	High correlation
`time` is highly correlated with `message` and 1 other fields	High correlation
`name` is highly correlated with `message` and 1 other fields	High correlation
`message` is highly correlated with `name` and 1 other fields	High correlation
`time` is highly correlated with `name` and 1 other fields	High correlation
`name` is uniformly distributed	Uniform
`message` is uniformly distributed	Uniform
`time` is uniformly distributed	Uniform
`time` has unique values	Unique

Reproduction

Analysis started	2022-06-09 14:06:23.736523
Analysis finished	2022-06-09 14:06:25.549025
Duration	1.81 second
Software version	pandas-profiling v3.2.0
Download configuration	config.json

name
Categorical

HIGH CORRELATION
HIGH CORRELATION
UNIFORM

Distinct	30
Distinct (%)	81.1%
Missing	0
Missing (%)	0.0%
Memory size	424.0 B

dog	3
everything will be gone	2
Slim Shady	2
Charles	2
Chainyo	2
Other values (25)	26

Length

Max length	23
Median length	12
Mean length	8.378378378
Min length	2

Characters and Unicode

Total characters	310
Distinct characters	34
Distinct categories	3 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	24 ?
Unique (%)	64.9%

Sample

1st row	Julien
2nd row	Someone else
3rd row	A friend
4th row	A friend
5th row	A stranger

Common Values

Value	Count	Frequency (%)
dog	3	8.1%
everything will be gone	2	5.4%
Slim Shady	2	5.4%
Charles	2	5.4%
Chainyo	2	5.4%
A friend	2	5.4%
Chris Emezue	1	2.7%
chef boyardee	1	2.7%
aa	1	2.7%
meow	1	2.7%
Other values (20)	20	54.1%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
dog	3	5.2%
a	3	5.2%
will	2	3.4%
be	2	3.4%
gone	2	3.4%
slim	2	3.4%
shady	2	3.4%
charles	2	3.4%
chainyo	2	3.4%
friend	2	3.4%
Other values (35)	36	62.1%

Most occurring characters

Value	Count	Frequency (%)
e	31	10.0%
i	23	7.4%
a	22	7.1%
	21	6.8%
h	19	6.1%
l	18	5.8%
o	17	5.5%
r	16	5.2%
n	15	4.8%
d	14	4.5%
Other values (24)	114	36.8%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	264	85.2%
Uppercase Letter	25	8.1%
Space Separator	21	6.8%

Most frequent character per category

Lowercase Letter

Value	Count	Frequency (%)
e	31	11.7%
i	23	8.7%
a	22	8.3%
h	19	7.2%
l	18	6.8%
o	17	6.4%
r	16	6.1%
n	15	5.7%
d	14	5.3%
s	10	3.8%
Other values (14)	79	29.9%

Uppercase Letter

Value	Count	Frequency (%)
S	7	28.0%
A	6	24.0%
C	5	20.0%
L	2	8.0%
J	1	4.0%
Y	1	4.0%
N	1	4.0%
E	1	4.0%
K	1	4.0%

Space Separator

Value	Count	Frequency (%)
	21	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	289	93.2%
Common	21	6.8%

Most frequent character per script

Latin

Value	Count	Frequency (%)
e	31	10.7%
i	23	8.0%
a	22	7.6%
h	19	6.6%
l	18	6.2%
o	17	5.9%
r	16	5.5%
n	15	5.2%
d	14	4.8%
s	10	3.5%
Other values (23)	104	36.0%

Common

Value	Count	Frequency (%)
	21	100.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	310	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
e	31	10.0%
i	23	7.4%
a	22	7.1%
	21	6.8%
h	19	6.1%
l	18	5.8%
o	17	5.5%
r	16	5.2%
n	15	4.8%
d	14	4.5%
Other values (24)	114	36.8%

message
Categorical

HIGH CORRELATION
HIGH CORRELATION
UNIFORM

Distinct	36
Distinct (%)	97.3%
Missing	0
Missing (%)	0.0%
Memory size	424.0 B

🔥🔥🔥🔥	2
How are you?	1
Hello everyone	1
The link to have access to the dataset seems to be down	1
i'm good :)	1
Other values (31)	31

Length

Max length	55
Median length	34
Mean length	14.45945946
Min length	2

Characters and Unicode

Total characters	535
Distinct characters	45
Distinct categories	7 ?
Distinct scripts	2 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	35 ?
Unique (%)	94.6%

Sample

1st row	How are you?
2nd row	good good
3rd row	🔥🔥🔥🔥
4th row	🔥🔥🔥🔥
5th row	interesting!

Common Values

Value	Count	Frequency (%)
🔥🔥🔥🔥	2	5.4%
How are you?	1	2.7%
Hello everyone	1	2.7%
The link to have access to the dataset seems to be down	1	2.7%
i'm good :)	1	2.7%
I m Lucas	1	2.7%
I love cats	1	2.7%
hello	1	2.7%
I need a text to image like looking glass	1	2.7%
how are you	1	2.7%
Other values (26)	26	70.3%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
you	6	5.5%
hello	5	4.5%
to	4	3.6%
i	4	3.6%
woof	3	2.7%
good	3	2.7%
are	3	2.7%
the	3	2.7%
love	3	2.7%
my	2	1.8%
Other values (65)	74	67.3%

Most occurring characters

Value	Count	Frequency (%)
	74	13.8%
e	62	11.6%
o	45	8.4%
l	33	6.2%
a	31	5.8%
t	31	5.8%
s	27	5.0%
i	22	4.1%
y	18	3.4%
r	15	2.8%
Other values (35)	177	33.1%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	406	75.9%
Space Separator	74	13.8%
Uppercase Letter	21	3.9%
Other Punctuation	14	2.6%
Decimal Number	11	2.1%
Other Symbol	8	1.5%
Close Punctuation	1	0.2%

Most frequent character per category

Lowercase Letter

Value	Count	Frequency (%)
e	62	15.3%
o	45	11.1%
l	33	8.1%
a	31	7.6%
t	31	7.6%
s	27	6.7%
i	22	5.4%
y	18	4.4%
r	15	3.7%
n	15	3.7%
Other values (13)	107	26.4%

Uppercase Letter

Value	Count	Frequency (%)
H	7	33.3%
I	4	19.0%
S	3	14.3%
T	2	9.5%
N	1	4.8%
M	1	4.8%
G	1	4.8%
L	1	4.8%
W	1	4.8%

Decimal Number

Value	Count	Frequency (%)
2	4	36.4%
1	2	18.2%
4	2	18.2%
3	2	18.2%
9	1	9.1%

Other Punctuation

Value	Count	Frequency (%)
!	4	28.6%
'	4	28.6%
.	3	21.4%
?	2	14.3%
:	1	7.1%

Space Separator

Value	Count	Frequency (%)
	74	100.0%

Other Symbol

Value	Count	Frequency (%)
🔥	8	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	427	79.8%
Common	108	20.2%

Most frequent character per script

Latin

Value	Count	Frequency (%)
e	62	14.5%
o	45	10.5%
l	33	7.7%
a	31	7.3%
t	31	7.3%
s	27	6.3%
i	22	5.2%
y	18	4.2%
r	15	3.5%
n	15	3.5%
Other values (22)	128	30.0%

Common

Value	Count	Frequency (%)
	74	68.5%
🔥	8	7.4%
2	4	3.7%
!	4	3.7%
'	4	3.7%
.	3	2.8%
1	2	1.9%
4	2	1.9%
?	2	1.9%
3	2	1.9%
Other values (3)	3	2.8%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	527	98.5%
None	8	1.5%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	74	14.0%
e	62	11.8%
o	45	8.5%
l	33	6.3%
a	31	5.9%
t	31	5.9%
s	27	5.1%
i	22	4.2%
y	18	3.4%
r	15	2.8%
Other values (34)	169	32.1%

None

Value	Count	Frequency (%)
🔥	8	100.0%

time
Categorical

HIGH CORRELATION
HIGH CORRELATION
UNIFORM
UNIQUE

Distinct	37
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	424.0 B

2021-10-15 19:33:29.506399	1
2021-12-15 18:01:20.248871	1
2021-12-20 07:43:13.477264	1
2021-12-20 07:44:50.373990	1
2022-03-10 12:38:44.469142	1
Other values (32)	32

Length

Max length	26
Median length	26
Mean length	26
Min length	26

Characters and Unicode

Total characters	962
Distinct characters	14
Distinct categories	4 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	37 ?
Unique (%)	100.0%

Sample

1st row	2021-10-15 19:33:29.506399
2nd row	2021-10-15 19:36:21.837263
3rd row	2021-10-15 19:38:08.592406
4th row	2021-10-15 19:38:14.693492
5th row	2021-11-05 20:48:09.082644

Common Values

Value	Count	Frequency (%)
2021-10-15 19:33:29.506399	1	2.7%
2021-12-15 18:01:20.248871	1	2.7%
2021-12-20 07:43:13.477264	1	2.7%
2021-12-20 07:44:50.373990	1	2.7%
2022-03-10 12:38:44.469142	1	2.7%
2022-03-10 13:51:10.874795	1	2.7%
2022-03-10 18:24:27.541837	1	2.7%
2022-03-18 21:32:19.477479	1	2.7%
2022-04-07 12:41:08.938456	1	2.7%
2022-04-07 17:12:20.197251	1	2.7%
Other values (27)	27	73.0%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
2021-11-05	6	8.1%
2021-10-15	4	5.4%
2021-11-09	4	5.4%
2022-03-10	3	4.1%
2021-11-06	3	4.1%
2022-04-07	3	4.1%
2022-05-10	2	2.7%
2021-12-20	2	2.7%
2022-05-07	2	2.7%
20:48:58.531611	1	1.4%
Other values (44)	44	59.5%

Most occurring characters

Value	Count	Frequency (%)
2	150	15.6%
0	144	15.0%
1	123	12.8%
-	74	7.7%
:	74	7.7%
4	62	6.4%
3	49	5.1%
8	48	5.0%
5	44	4.6%
6	41	4.3%
Other values (4)	153	15.9%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	740	76.9%
Other Punctuation	111	11.5%
Dash Punctuation	74	7.7%
Space Separator	37	3.8%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
2	150	20.3%
0	144	19.5%
1	123	16.6%
4	62	8.4%
3	49	6.6%
8	48	6.5%
5	44	5.9%
6	41	5.5%
9	40	5.4%
7	39	5.3%

Other Punctuation

Value	Count	Frequency (%)
:	74	66.7%
.	37	33.3%

Dash Punctuation

Value	Count	Frequency (%)
-	74	100.0%

Space Separator

Value	Count	Frequency (%)
	37	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	962	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
2	150	15.6%
0	144	15.0%
1	123	12.8%
-	74	7.7%
:	74	7.7%
4	62	6.4%
3	49	5.1%
8	48	5.0%
5	44	4.6%
6	41	4.3%
Other values (4)	153	15.9%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	962	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
2	150	15.6%
0	144	15.0%
1	123	12.8%
-	74	7.7%
:	74	7.7%
4	62	6.4%
3	49	5.1%
8	48	5.0%
5	44	4.6%
6	41	4.3%
Other values (4)	153	15.9%

Cramér's V (φc)
Phik (φk)

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows

	name	message	time
0	Julien	How are you?	2021-10-15 19:33:29.506399
1	Someone else	good good	2021-10-15 19:36:21.837263
2	A friend	🔥🔥🔥🔥	2021-10-15 19:38:08.592406
3	A friend	🔥🔥🔥🔥	2021-10-15 19:38:14.693492
4	A stranger	interesting!	2021-11-05 20:48:09.082644
5	Shubham Singh	Hello you are you.	2021-11-05 20:48:42.430647
6	dog	I love dogs	2021-11-05 20:48:58.531611
7	Abubakar Abid	Test	2021-11-05 20:49:10.729872
8	Charles	Hello	2021-11-05 21:59:58.126933
9	Charles	Hello2	2021-11-05 22:00:17.768448

Last rows

	name	message	time
27	micole	I need a text to image like looking glass	2022-04-07 12:41:08.938456
28	Alex	Hello everyone	2022-04-07 17:12:20.197251
29	tomriddle	how are you	2022-04-07 18:22:01.721690
30	meow	great persistence example. cant figure mine out yet.	2022-04-16 00:01:26.027707
31	chef boyardee	have you tried my raviolis?	2022-04-24 03:15:00.496292
32	Chris Emezue	Hello there	2022-04-26 18:16:45.273903
33	aa	test123	2022-05-07 16:42:26.706482
34	bb	test234	2022-05-07 16:42:36.803281
35	dog	woof	2022-05-10 03:58:24.036197
36	dog	woof woof	2022-05-10 03:58:38.850884

Overview

Variables

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Lowercase Letter

Uppercase Letter

Space Separator

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Lowercase Letter

Uppercase Letter

Decimal Number

Other Punctuation

Space Separator

Other Symbol

Close Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

None

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Other Punctuation

Dash Punctuation

Space Separator

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Correlations

Cramér's V (φc)

Phik (φk)

Missing values

Sample

First rows

Last rows