By user347221


2019-04-15 14:37:34 8 Comments

I have a file called data whose contents are

id,col1,col2
0,-0.3479417882673812,0.5664382596767175
1,-0.26800930980980764,0.2952025161991604
2,-0.4159790791116641,-1.3375045524610152
3,-0.7859665489205871,-0.6428101880909471
4,-1.3922759043388822,-1.676262144826317
5,-1.2471867496427498,-0.4912119581361516
6,1.443385383041667,1.6974039491263593
7,-2.058899802821969,2.0607628464079917
8,-0.10641338441541626,0.035929568275064216
9,-0.517273684861199,-0.6184800988804992
10,-0.9934859021679552,1.0577312348984502
11,0.5923834706792905,-0.6693757541250825
12,0.8657741917554445,-0.6876271057571398
13,-1.2061097548360489,-0.7402582563022937
14,0.78768021182158,-0.38607117005262315

Sorting numerically (-n) on the first column gives

$ sort -nk1 -t"," data
0,-0.3479417882673812,0.5664382596767175
id,col1,col2
1,-0.26800930980980764,0.2952025161991604
2,-0.4159790791116641,-1.3375045524610152
3,-0.7859665489205871,-0.6428101880909471
4,-1.3922759043388822,-1.676262144826317
5,-1.2471867496427498,-0.4912119581361516
7,-2.058899802821969,2.0607628464079917
8,-0.10641338441541626,0.035929568275064216
9,-0.517273684861199,-0.6184800988804992
10,-0.9934859021679552,1.0577312348984502
13,-1.2061097548360489,-0.7402582563022937
6,1.443385383041667,1.6974039491263593
11,0.5923834706792905,-0.6693757541250825
12,0.8657741917554445,-0.6876271057571398
14,0.78768021182158,-0.38607117005262315

This absolutely bizarre to me. I read in the man page that -n is supposed to be numerical sort. Why would id be placed in-between numbers? How is it that 10 is larger than 9, but smaller than 6, all the while 11 being greater than them all?

The -g seems to work as I want (and as I think is natural), but this -n option totally escapes me. What's this about? I think it can be related to locale, but once I specify the delimiter as being ,, I don't think that would explain it.

1 comments

@Stéphane Chazelas 2019-04-15 14:47:59

TL;DR

Use sort -nk1,1 -t, or otherwise with -k1 you're sorting on the full line where , is discarded in numbers as it's interpreted as a thousand separator.

Details

In English language locales, , is the thousand separator, which sort ignores in the integer part of numbers.

In other words, in an English language locale, or any locale where , is a thousand separator (see the output of locale thousands_sep), when sort -n sees 11,000,000 it doesn't see the 11 number followed by some ignored garbage, but the 11000000 number. Similarly 11,0 is not 11 but 110.

Now (and that's something many people trip on), -k1 defines a key that starts with the first field, but as you didn't specify where it stops, ends at the end of the line, so the sort key is the full line, which is the default.

So sort -nk1 -t, is exactly the same as sort -n.

With , ignored as a thousand separator, on your input sort is actually sorting these numbers:

0
1
2
3
4
5
61.4433853830416671
7
8
9
10
110.5923834706792905
120.8657741917554445
13
140.78768021182158

So it's not 6 vs 10 vs 11, but 61.4433853830416671 vs 10 vs 110.5923834706792905.

Here, you want:

sort -nk1,1 -t,

To sort on the first ,-delimited field only. -k1,1 defines a sort key that starts at the start of the first field and ends at the end of the first field.

You can also use sort -n in the C locale where , is neither the decimal radix nor the thousand separator (and . is the decimal radix):

LC_ALL=C sort -n

sort -g works differently because sort then uses strtold() to interpret the key as a number and strtold() doesn't recognise thousands separators.

As far as the id header line is concerned, in a numeric comparison, that id... is interpreted as 0 as there's no number to be seen there. It sorts after the line that starts with 0 because when two records sort the same (here with -n in a numeric comparison) sort does a last resort comparison which is a lexical comparison of the full line (and 0 sorts before i).

With some sort implementations, that last resort comparison can be disabled with -s. Here LC_ALL=C sort -sn would put the id line first, but that's only because there are no negative keys in the input (id (which again numerically is 0) would still sort after -1). If you want to exclude the first line from the sorting, you can do:

(head -n1; LC_ALL=C sort -n) < file

@user347221 2019-04-15 16:18:05

Thanks. sort -t, -n -k1,1 is not working for me, it's placing 0 above id. Also, does your answer explain why 10 is larger than 9, but smaller than 6, all the while 11 being greater than them all? It's genuine question, I'm not able to answer this myself from reading your answer.

@Stéphane Chazelas 2019-04-15 16:34:39

@user347221, see if the edit makes it any clearer.

@Barmar 2019-04-15 16:57:56

Why do you talk so much about the thousands separator? There's nothing in the question that suggests that they expected it to be part of the number. They have -t"," to use it as the field delimiter.

@Barmar 2019-04-15 16:59:08

Where is 61.4433853830416671 in the input file? I see 6,1.443385383041667,1.6974039491263593.

@Barmar 2019-04-15 17:00:35

Is the actual problem just that they put -t"," after the key specification instead of before it?

@Stéphane Chazelas 2019-04-15 17:10:50

@Barmar, they have -t, but sort on the full line (-k1 which is superflous as that's the default) instead of the first field (-k1,1). 6,1.443385383041667 is interpreted by sort -n as 61.4433853830416671 because that , thousand separator is ignored.

@user347221 2019-04-15 17:22:06

@StéphaneChazelas Thank you, this is great. unix.SE is eons beyond other SE sites. People actually help here. I've had great experiences with many of the users here. Gilles, Kusalananda, Stephen, terdon to name a few. Thank you all.

@user347221 2019-04-15 17:23:17

What is the syntax to get it to sort on the second and third columns, then on the fifth and stop here?

@user347221 2019-04-15 21:33:06

For those looking to an answer to my question in the comment above. Here it is: unix.stackexchange.com/questions/78925/…

Related Questions

Sponsored Content

2 Answered Questions

[SOLVED] Sort unix alphabetically then numerically, not working as I intended

2 Answered Questions

[SOLVED] sort a list numerically and alphabetically

  • 2019-04-14 19:46:17
  • Michael Sovich
  • 63 View
  • 0 Score
  • 2 Answer
  • Tags:   sort

1 Answered Questions

2 Answered Questions

[SOLVED] Sorting numerically by character field position within field

  • 2017-11-16 15:25:29
  • roaima
  • 1155 View
  • 1 Score
  • 2 Answer
  • Tags:   sort

2 Answered Questions

[SOLVED] sorting numerically in unix

  • 2017-08-29 20:32:54
  • Anna1364
  • 803 View
  • -1 Score
  • 2 Answer
  • Tags:   shell sort

2 Answered Questions

Sorting filenames

  • 2016-12-07 20:39:37
  • stacko
  • 317 View
  • 0 Score
  • 2 Answer
  • Tags:   sort

4 Answered Questions

[SOLVED] sorting filenames numerically when they have non-numeric prefix

  • 2011-07-19 09:00:39
  • Let_Me_Be
  • 2900 View
  • 5 Score
  • 4 Answer
  • Tags:   bash rename sort

2 Answered Questions

[SOLVED] How to sort lines that contain "_" numerically?

  • 2015-05-18 10:45:15
  • cwmwl
  • 596 View
  • 6 Score
  • 2 Answer
  • Tags:   sort

3 Answered Questions

[SOLVED] Sorting data faster approach

  • 2014-06-30 11:31:34
  • biobudhan
  • 996 View
  • 11 Score
  • 3 Answer
  • Tags:   sort

3 Answered Questions

[SOLVED] Numerically sorting files

  • 2012-05-14 10:47:35
  • user18815
  • 1025 View
  • 1 Score
  • 3 Answer
  • Tags:   shell sort

Sponsored Content