When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']? -
i using split('\n') lines in 1 string, , found ''.split() returns empty list, [], while ''.split('\n') returns ['']. there specific reason such difference?
and there more convenient way count lines in string?
question: using split('\n') lines in 1 string, , found ''.split() returns empty list [], while ''.split('\n') returns [''].
the str.split() method has 2 algorithms. if no arguments given, splits on repeated runs of whitespace. however, if argument given, treated single delimiter no repeated runs.
in case of splitting empty string, first mode (no argument) return empty list because whitespace eaten , there no values put in result list.
in contrast, second mode (with argument such \n) produce first empty field. consider if had written '\n'.split('\n'), 2 fields (one split, gives 2 halves).
question: there specific reason such difference?
this first mode useful when data aligned in columns variable amounts of whitespace. example:
>>> data = '''\ shasta california 14,200 mckinley alaska 20,300 fuji japan 12,400 ''' >>> line in data.splitlines(): print line.split() ['shasta', 'california', '14,200'] ['mckinley', 'alaska', '20,300'] ['fuji', 'japan', '12,400'] the second mode useful delimited data such csv repeated commas denote empty fields. example:
>>> data = '''\ guido,bdfl,,amsterdam barry,flufl,,usa tim,,,usa ''' >>> line in data.splitlines(): print line.split(',') ['guido', 'bdfl', '', 'amsterdam'] ['barry', 'flufl', '', 'usa'] ['tim', '', '', 'usa'] note, number of result fields 1 greater number of delimiters. think of cutting rope. if make no cuts, have 1 piece. making 1 cut, gives 2 pieces. making 2 cuts, gives 3 pieces. , python's str.split(delimiter) method:
>>> ''.split(',') # no cuts [''] >>> ','.split(',') # 1 cut ['', ''] >>> ',,'.split(',') # 2 cuts ['', '', ''] question: , there more convenient way count lines in string?
yes, there couple of easy ways. 1 uses str.count() , other uses str.splitlines(). both ways give same answer unless final line missing \n. if final newline missing, str.splitlines approach give accurate answer. faster technique accurate uses count method corrects final newline:
>>> data = '''\ line 1 line 2 line 3 line 4''' >>> data.count('\n') # inaccurate 3 >>> len(data.splitlines()) # accurate, slow 4 >>> data.count('\n') + (not data.endswith('\n')) # accurate , fast 4 question @kaz: why heck 2 different algorithms shoe-horned single function?
the signature str.split 20 years old, , number of apis era strictly pragmatic. while not perfect, method signature isn't "terrible" either. part, guido's api design choices have stood test of time.
the current api not without advantages. consider strings such as:
ps_aux_header = "user pid %cpu %mem vsz" patient_header = "name,age,height,weight" when asked break these strings fields, people tend describe both using same english word, "split". when asked read code such fields = line.split() or fields = line.split(','), people tend correctly interpret statements "splits line fields".
microsoft excel's text-to-columns tool made similar api choice , incorporates both splitting algorithms in same tool. people seem mentally model field-splitting single concept though more 1 algorithm involved.
Comments
Post a Comment